Python Forum

Full Version: Calculating median value from time data series
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I need to calculate median value from time data array. I can manage it with numeric values but in datetime format it is a real headache. Can someone understand and explain how to do this. There's needed some data format conversions but I can't figure out how.
And which one (numpy or pandas) is more appropriate and efficient way in calculating median value?

>>> import pandas as pd
>>> import numpy as np
CREATE DATAFRAMES
>>> df1 = pd.DataFrame({'Value': [1, 2, 3]})
>>> df2 = pd.DataFrame({'Value': ['02:00:00', '03:00:00', '04:00:00']})
NUMPY NUMERIC MEDIAN
>>> numpy_numeric_median = np.median(df1)
>>> print(numpy_numeric_median)
2.0
PANDAS NUMERIC MEDIAN
>>> pandas_numeric_median = df1['Value'].median()
>>> print(pandas_numeric_median)
2.0
NUMPY TIME MEDIAN
>>> numpy_time_median = np.median(df2)
TypeError: unsupported operand type(s) for /: 'str' and 'int'

>>> df2_datetime_format = np.array(pd.to_datetime(df2['Value']), dtype=np.datetime64)
array(['2018-08-21T02:00:00.000000000', '2018-08-21T03:00:00.000000000', '2018-08-21T04:00:00.000000000'], dtype='datetime64[ns]')
>>> numpy_time_median = np.median(df2_datetime_format)
TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')
PANDAS TIME MEDIAN
>>> pandas_time_median = df2['Value'].median()
TypeError: could not convert string to float: '04:00:00'

>>> df2_datetime_format = pd.to_datetime(df2['Value'])
0   2018-08-21 02:00:00
1   2018-08-21 03:00:00
2   2018-08-21 04:00:00
Name: Value, dtype: datetime64[ns]

>>> pandas_time_median = df2_datetime_format['Value'].median()
TypeError: an integer is required

>>> pandas_time_median = df2_datetime_format.median()
TypeError: reduction operation 'median' not allowed for this dtype
I don't know whether or not there is some convenient way to do this, but simply sorting them and finding the middle (or 2 middles in the case of an even length sequence) is straightforward.

>>> length = len(df2_datetime_format)
>>> length
3
>>> df2_datetime_format
0   2018-08-22 02:00:00
1   2018-08-22 03:00:00
2   2018-08-22 04:00:00
Name: Value, dtype: datetime64[ns]
>>> sorted(df2_datetime_format)[length//2]
Timestamp('2018-08-22 03:00:00')
Not sure if that is sufficient to your needs.