Python Forum - pandas column percentile

Dear All,
I have a time series in a dataframe that contain vaues for each day for many years. How can I compute the average of values above a certain percentile for, say, each month.
Thank you

Please show your code, so we know what you're working with.

(Jul-22-2022, 10:08 AM)Larz60+ Wrote: [ -> ]Please show your code, so we know what you're working with.

data = pd.read_csv('timeseries_data.csv',parse_dates={'Timestamp':[2]},index_col='Timestamp',dayfirst=True,na_values=['-'],skipfooter=1,engine='python')
data
Timestamp
1975-01-02 0.0
1975-01-03 0.5
1975-01-04 0.1
1975-01-05 0.0
1975-01-06 0.0
...
2021-03-27 0.0
2021-03-28 0.0
2021-03-29 0.0
2021-03-30 0.0
2021-03-31 0.0
What I need is to find the 90th percentile for each month and average data above the 90th percentile. This is similar to taking the maximum value in each month, which I could do. However, I feel taking an average of highest values of each month may be better than a single maximum value.
I tried
dataN=data[data >= np.percentile(data,0.9)]
dataN.groupby(pd.Grouper(freq='M')).mean()

but seems to be stupid and gets the following error

TypeError: Invalid comparison between dtype=float64 and Timestamp

Any help is greatly appreciated
Nuncio

I hate dates. Ok that off my chest -
I would create new columns based on the timestamp for year, month, and date, make those integers.
Would then use groupby on the month column rather than trying to use the timestamp.
Add 'em up, calculate 90th percentile, then select the records that match 90th percentile or above and calculate the average of that subset.

Just my idea.
J

see: https://pandas.pydata.org/pandas-docs/st...etime.html

this will create a sorted date/time index example:

import pandas as pd


timedata = [
    "2021-03-28",
    "2021-03-29",
    "1975-01-02",
    "1975-01-03",
    "1975-01-04",
    "1975-01-05",
    "1975-01-06",
    "2021-03-27",
    "2021-03-30",
    "2021-03-31"
]

dates = pd.to_datetime(timedata, format="%Y-%m-%d").sort_values()
print(dates)

results:

Output:DatetimeIndex(['1975-01-02', '1975-01-03', '1975-01-04', '1975-01-05',
               '1975-01-06', '2021-03-27', '2021-03-28', '2021-03-29',
               '2021-03-30', '2021-03-31'],
              dtype='datetime64[ns]', freq=None)

(Jul-24-2022, 12:41 PM)jefsummers Wrote: [ -> ]I hate dates. Ok that off my chest -
I would create new columns based on the timestamp for year, month, and date, make those integers.
Would then use groupby on the month column rather than trying to use the timestamp.
Add 'em up, calculate 90th percentile, then select the records that match 90th percentile or above and calculate the average of that subset.

Just my idea.
J

Hi Jeff,
I could group the data by months as follows

dataP=dataH[var1].groupby(pd.Grouper(freq='M'))
dataQ=dataP.quantile(0.9)
but when I filter the data for values greter than 90 percentiles
dataX=dataP[dataP >= dataQ]

get the following error

ValueError Traceback (most recent call last)
<ipython-input-128-92cfb00c3619> in <module>
3 #dataP.reset_index()
4 dataQ=dataP.quantile(0.9)
----> 5 dataX=dataP[dataP >= dataQ]
6 #for val in dataP:
7 # if (val > dataP.quantile(0.9)):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
62 other = item_from_zerodim(other)
63
---> 64 return method(self, other)
65
66 return new_method

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\__init__.py in wrapper(self, other)
524 rvalues = extract_array(other, extract_numpy=True)
525
--> 526 res_values = comparison_op(lvalues, rvalues, op)
527
528 return _construct_result(self, res_values, index=self.index, name=res_name)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
251 method = getattr(lvalues, op_name)
252 with np.errstate(all="ignore"):
--> 253 res_values = method(rvalues)
254
255 if res_values is NotImplemented:

ValueError: operands could not be broadcast together with shapes (555,) (555,2)

Thank you for the help
Nuncio

Pandas quantile returns a dataframe or series. So, dataP >= dataQ is comparing two dataframes. You need to find the value that is the breakpoint and compare to that.

(Aug-04-2022, 02:25 PM)jefsummers Wrote: [ -> ]Pandas quantile returns a dataframe or series. So, dataP >= dataQ is comparing two dataframes. You need to find the value that is the breakpoint and compare to that.

Hi
I could not do it with groupby method, instead I used a loop as below

from datetime import date, timedelta, datetime
from dateutil.relativedelta import relativedelta

start_date = date(1975, 1, 1)
end_date2=date(2020,12,31)
delta=relativedelta(months=1)

while start_date < end_date2:
# print(start_date.strftime("%Y-%m-%d"))
end_date = (start_date+relativedelta(months=1))-timedelta(days=1)
dataX=dataP[start_date:end_date][dataP[start_date:end_date] >= dataP[start_date:end_date].quantile(0.90)]
print(start_date.strftime("%Y-%m-%d"))
print(end_date.strftime("%Y-%m-%d"))
print(dataX.mean())
start_date = start_date+relativedelta(months=1)

this gives the mean, bt are there any other elegant method to do this?
Thank you
nuncio