pandas column percentile

nuncio · Jul-22-2022, 06:29 AM

Dear All,
I have a time series in a dataframe that contain vaues for each day for many years. How can I compute the average of values above a certain percentile for, say, each month.
Thank you

**Larz60+** · Jul-22-2022, 10:08 AM

Please show your code, so we know what you're working with.

nuncio · Jul-22-2022, 10:54 AM

(Jul-22-2022, 10:08 AM)Larz60+ Wrote: Please show your code, so we know what you're working with.

data = pd.read_csv('timeseries_data.csv',parse_dates={'Timestamp':[2]},index_col='Timestamp',dayfirst=True,na_values=['-'],skipfooter=1,engine='python')
data
Timestamp
1975-01-02 0.0
1975-01-03 0.5
1975-01-04 0.1
1975-01-05 0.0
1975-01-06 0.0
...
2021-03-27 0.0
2021-03-28 0.0
2021-03-29 0.0
2021-03-30 0.0
2021-03-31 0.0
What I need is to find the 90th percentile for each month and average data above the 90th percentile. This is similar to taking the maximum value in each month, which I could do. However, I feel taking an average of highest values of each month may be better than a single maximum value.
I tried
dataN=data[data >= np.percentile(data,0.9)]
dataN.groupby(pd.Grouper(freq='M')).mean()

but seems to be stupid and gets the following error

TypeError: Invalid comparison between dtype=float64 and Timestamp

Any help is greatly appreciated
Nuncio

jefsummers · Jul-24-2022, 12:41 PM

I hate dates. Ok that off my chest -
I would create new columns based on the timestamp for year, month, and date, make those integers.
Would then use groupby on the month column rather than trying to use the timestamp.
Add 'em up, calculate 90th percentile, then select the records that match 90th percentile or above and calculate the average of that subset.

Just my idea.
J

**Larz60+** · Jul-24-2022, 01:18 PM

see: https://pandas.pydata.org/pandas-docs/st...etime.html

this will create a sorted date/time index example:

import pandas as pd


timedata = [
    "2021-03-28",
    "2021-03-29",
    "1975-01-02",
    "1975-01-03",
    "1975-01-04",
    "1975-01-05",
    "1975-01-06",
    "2021-03-27",
    "2021-03-30",
    "2021-03-31"
]

dates = pd.to_datetime(timedata, format="%Y-%m-%d").sort_values()
print(dates)

results:

Output:DatetimeIndex(['1975-01-02', '1975-01-03', '1975-01-04', '1975-01-05',
               '1975-01-06', '2021-03-27', '2021-03-28', '2021-03-29',
               '2021-03-30', '2021-03-31'],
              dtype='datetime64[ns]', freq=None)

nuncio · Aug-04-2022, 10:52 AM

(Jul-24-2022, 12:41 PM)jefsummers Wrote: I hate dates. Ok that off my chest -
I would create new columns based on the timestamp for year, month, and date, make those integers.
Would then use groupby on the month column rather than trying to use the timestamp.
Add 'em up, calculate 90th percentile, then select the records that match 90th percentile or above and calculate the average of that subset.

Just my idea.
J

Hi Jeff,
I could group the data by months as follows

dataP=dataH[var1].groupby(pd.Grouper(freq='M'))
dataQ=dataP.quantile(0.9)
but when I filter the data for values greter than 90 percentiles
dataX=dataP[dataP >= dataQ]

get the following error

ValueError Traceback (most recent call last)
<ipython-input-128-92cfb00c3619> in <module>
3 #dataP.reset_index()
4 dataQ=dataP.quantile(0.9)
----> 5 dataX=dataP[dataP >= dataQ]
6 #for val in dataP:
7 # if (val > dataP.quantile(0.9)):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
62 other = item_from_zerodim(other)
63
---> 64 return method(self, other)
65
66 return new_method

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\__init__.py in wrapper(self, other)
524 rvalues = extract_array(other, extract_numpy=True)
525
--> 526 res_values = comparison_op(lvalues, rvalues, op)
527
528 return _construct_result(self, res_values, index=self.index, name=res_name)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
251 method = getattr(lvalues, op_name)
252 with np.errstate(all="ignore"):
--> 253 res_values = method(rvalues)
254
255 if res_values is NotImplemented:

ValueError: operands could not be broadcast together with shapes (555,) (555,2)

Thank you for the help
Nuncio

jefsummers · Aug-04-2022, 02:25 PM

Pandas quantile returns a dataframe or series. So, dataP >= dataQ is comparing two dataframes. You need to find the value that is the breakpoint and compare to that.

nuncio · Aug-10-2022, 04:41 AM

(Aug-04-2022, 02:25 PM)jefsummers Wrote: Pandas quantile returns a dataframe or series. So, dataP >= dataQ is comparing two dataframes. You need to find the value that is the breakpoint and compare to that.

Hi
I could not do it with groupby method, instead I used a loop as below

from datetime import date, timedelta, datetime
from dateutil.relativedelta import relativedelta

start_date = date(1975, 1, 1)
end_date2=date(2020,12,31)
delta=relativedelta(months=1)

while start_date < end_date2:
# print(start_date.strftime("%Y-%m-%d"))
end_date = (start_date+relativedelta(months=1))-timedelta(days=1)
dataX=dataP[start_date:end_date][dataP[start_date:end_date] >= dataP[start_date:end_date].quantile(0.90)]
print(start_date.strftime("%Y-%m-%d"))
print(end_date.strftime("%Y-%m-%d"))
print(dataX.mean())
start_date = start_date+relativedelta(months=1)

this gives the mean, bt are there any other elegant method to do this?
Thank you
nuncio

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Find duplicates in a pandas dataframe list column on other rows	Calab	2	2,591	Sep-18-2024, 07:38 PM Last Post: Calab
	Find strings by index from a list of indexes in a different Pandas dataframe column	Calab	3	1,858	Aug-26-2024, 04:52 PM Last Post: Calab
	HTML Decoder pandas dataframe column	mbrown009	3	2,914	Sep-29-2023, 05:56 PM Last Post: deanhystad
	pandas: Compute the % of the unique values in a column	JaneTan	1	2,538	Oct-25-2021, 07:55 PM Last Post: jefsummers
	Pandas Data frame column condition check based on length of the value	aditi06	1	3,960	Jul-28-2021, 11:08 AM Last Post: jefsummers
	How to move each team row to a new column. Pandas	vladiwnl	0	2,208	Jun-13-2021, 08:10 AM Last Post: vladiwnl
	iretate over columns in df and calculate euclidean distance with one column in pandas	Pit292	0	4,494	May-09-2021, 06:46 PM Last Post: Pit292
	Pandas - Creating additional column in dataframe from another column	Azureaus	2	4,052	Jan-11-2021, 09:53 PM Last Post: Azureaus
	Pandas: summing columns conditional on the column labels	ddd2332	0	2,700	Sep-10-2020, 05:58 PM Last Post: ddd2332
	Pandas DataFrame and unmatched column	sritsv19	0	3,787	Jul-07-2020, 12:52 PM Last Post: sritsv19

pandas column percentile

User Panel Messages

Announcements