Python Forum
pandas column percentile
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pandas column percentile
#1
Dear All,
I have a time series in a dataframe that contain vaues for each day for many years. How can I compute the average of values above a certain percentile for, say, each month.
Thank you
Reply
#2
Please show your code, so we know what you're working with.
Reply
#3
(Jul-22-2022, 10:08 AM)Larz60+ Wrote: Please show your code, so we know what you're working with.

data = pd.read_csv('timeseries_data.csv',parse_dates={'Timestamp':[2]},index_col='Timestamp',dayfirst=True,na_values=['-'],skipfooter=1,engine='python')
data
Timestamp
1975-01-02 0.0
1975-01-03 0.5
1975-01-04 0.1
1975-01-05 0.0
1975-01-06 0.0
...
2021-03-27 0.0
2021-03-28 0.0
2021-03-29 0.0
2021-03-30 0.0
2021-03-31 0.0
What I need is to find the 90th percentile for each month and average data above the 90th percentile. This is similar to taking the maximum value in each month, which I could do. However, I feel taking an average of highest values of each month may be better than a single maximum value.
I tried
dataN=data[data >= np.percentile(data,0.9)]
dataN.groupby(pd.Grouper(freq='M')).mean()

but seems to be stupid and gets the following error

TypeError: Invalid comparison between dtype=float64 and Timestamp

Any help is greatly appreciated
Nuncio
Reply
#4
I hate dates. Ok that off my chest -
I would create new columns based on the timestamp for year, month, and date, make those integers.
Would then use groupby on the month column rather than trying to use the timestamp.
Add 'em up, calculate 90th percentile, then select the records that match 90th percentile or above and calculate the average of that subset.

Just my idea.
J
Reply
#5
see: https://pandas.pydata.org/pandas-docs/st...etime.html

this will create a sorted date/time index example:
import pandas as pd


timedata = [
    "2021-03-28",
    "2021-03-29",
    "1975-01-02",
    "1975-01-03",
    "1975-01-04",
    "1975-01-05",
    "1975-01-06",
    "2021-03-27",
    "2021-03-30",
    "2021-03-31"
]

dates = pd.to_datetime(timedata, format="%Y-%m-%d").sort_values()
print(dates)
results:
Output:
DatetimeIndex(['1975-01-02', '1975-01-03', '1975-01-04', '1975-01-05', '1975-01-06', '2021-03-27', '2021-03-28', '2021-03-29', '2021-03-30', '2021-03-31'], dtype='datetime64[ns]', freq=None)
Reply
#6
(Jul-24-2022, 12:41 PM)jefsummers Wrote: I hate dates. Ok that off my chest -
I would create new columns based on the timestamp for year, month, and date, make those integers.
Would then use groupby on the month column rather than trying to use the timestamp.
Add 'em up, calculate 90th percentile, then select the records that match 90th percentile or above and calculate the average of that subset.

Just my idea.
J
Hi Jeff,
I could group the data by months as follows

dataP=dataH[var1].groupby(pd.Grouper(freq='M'))
dataQ=dataP.quantile(0.9)
but when I filter the data for values greter than 90 percentiles
dataX=dataP[dataP >= dataQ]

get the following error

ValueError Traceback (most recent call last)
<ipython-input-128-92cfb00c3619> in <module>
3 #dataP.reset_index()
4 dataQ=dataP.quantile(0.9)
----> 5 dataX=dataP[dataP >= dataQ]
6 #for val in dataP:
7 # if (val > dataP.quantile(0.9)):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
62 other = item_from_zerodim(other)
63
---> 64 return method(self, other)
65
66 return new_method

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\__init__.py in wrapper(self, other)
524 rvalues = extract_array(other, extract_numpy=True)
525
--> 526 res_values = comparison_op(lvalues, rvalues, op)
527
528 return _construct_result(self, res_values, index=self.index, name=res_name)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
251 method = getattr(lvalues, op_name)
252 with np.errstate(all="ignore"):
--> 253 res_values = method(rvalues)
254
255 if res_values is NotImplemented:

ValueError: operands could not be broadcast together with shapes (555,) (555,2)

Thank you for the help
Nuncio
Reply
#7
Pandas quantile returns a dataframe or series. So, dataP >= dataQ is comparing two dataframes. You need to find the value that is the breakpoint and compare to that.
Reply
#8
(Aug-04-2022, 02:25 PM)jefsummers Wrote: Pandas quantile returns a dataframe or series. So, dataP >= dataQ is comparing two dataframes. You need to find the value that is the breakpoint and compare to that.
Hi
I could not do it with groupby method, instead I used a loop as below

from datetime import date, timedelta, datetime
from dateutil.relativedelta import relativedelta

start_date = date(1975, 1, 1)
end_date2=date(2020,12,31)
delta=relativedelta(months=1)

while start_date < end_date2:
# print(start_date.strftime("%Y-%m-%d"))
end_date = (start_date+relativedelta(months=1))-timedelta(days=1)
dataX=dataP[start_date:end_date][dataP[start_date:end_date] >= dataP[start_date:end_date].quantile(0.90)]
print(start_date.strftime("%Y-%m-%d"))
print(end_date.strftime("%Y-%m-%d"))
print(dataX.mean())
start_date = start_date+relativedelta(months=1)

this gives the mean, bt are there any other elegant method to do this?
Thank you
nuncio
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HTML Decoder pandas dataframe column mbrown009 3 1,057 Sep-29-2023, 05:56 PM
Last Post: deanhystad
  pandas: Compute the % of the unique values in a column JaneTan 1 1,788 Oct-25-2021, 07:55 PM
Last Post: jefsummers
  Pandas Data frame column condition check based on length of the value aditi06 1 2,701 Jul-28-2021, 11:08 AM
Last Post: jefsummers
  How to move each team row to a new column. Pandas vladiwnl 0 1,724 Jun-13-2021, 08:10 AM
Last Post: vladiwnl
  iretate over columns in df and calculate euclidean distance with one column in pandas Pit292 0 3,311 May-09-2021, 06:46 PM
Last Post: Pit292
Question Pandas - Creating additional column in dataframe from another column Azureaus 2 2,976 Jan-11-2021, 09:53 PM
Last Post: Azureaus
  Pandas: summing columns conditional on the column labels ddd2332 0 2,132 Sep-10-2020, 05:58 PM
Last Post: ddd2332
  Pandas DataFrame and unmatched column sritsv19 0 3,031 Jul-07-2020, 12:52 PM
Last Post: sritsv19
  Pandas - Dynamic column aggregation based on another column theroadbacktonature 0 3,067 Apr-17-2020, 04:54 PM
Last Post: theroadbacktonature
  Add column to CSV using Pandas nsadams87xx 2 2,814 Apr-15-2020, 08:41 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020