Python Forum
Getting the hourly average of a time series dataset
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Getting the hourly average of a time series dataset
#1
Hello guys, I have a large dataframe of 10000 rows and 2 columns .The first column is the dates(years,months,days,hours,minutes) as index. My dates column doesnt have a stable frequency. I am trying to get the average value of 2nd column of every 1 hour .
Any ideas on how can I iterate this problem for the whole time series?
My df looks like this and I want to get the average value from 20:00 to 21:00 and so on.

time value
2019-03-21 20:03:30.351000 22.6
2019-03-21 20:33:30.392000 22.7
2019-03-21 20:38:30.428000 22.6
2019-03-21 20:43:30.363000 22.7
2019-03-21 21:13:30.382000 22.6
2019-03-21 22:33:34.762000 22.5
2019-03-21 22:38:34.696000 22.6
Reply
#2
Something like this maybe?:

number_of_entries = 0
total = 0
For time in column:
    number_of_entries += 1
    total += time
average = total/number_of_entries
    
Reply
#3
(Nov-24-2020, 03:38 PM)michael1789 Wrote: Something like this maybe?:

number_of_entries = 0
total = 0
For time in column:
    number_of_entries += 1
    total += time
average = total/number_of_entries
    

No I am trying to get the average of every hour. For example if 4 values are in the range of 1 hour, I am trying to get the average of them, if I have 2 values in the next hour I am trying to get the average of those 2. Also, how does this code iterates the datetime index every 1 hour?
Reply
#4
You may want to use df.period.hour() to extract the hour from your datetime, then define a function to get the values for a given hour on a particular date and compute the average. If you can show us what you have tried we can help you if you get stuck.

Read about Periods here: https://pandas.pydata.org/pandas-docs/st...eriod.html
Reply
#5
count = 0
for year, z in df.groupby(df.index.map(lambda t: t.year)):
       for month, x  in df.groupby(df.index.map(lambda t: t.month)):
               for day, i in df.groupby(df.index.map(lambda t: t.day)):
                      for hour, j  in df.groupby(df.index.map(lambda t: t.hour)):
                          count = count + 1
This is what I was trying to do, problem is that it doesnt work as I thought it would be.
I want to check for example for the hour 14:00:00 of day 1 from my dataset and see how many values are in that range.Then get the average of them and store them in a new dataframe.Then keep doing this for every day and every hour in my dataset.
Reply
#6
(Nov-25-2020, 03:01 AM)palladium Wrote: You may want to use df.period.hour() to extract the hour from your datetime, then define a function to get the values for a given hour on a particular date and compute the average. If you can show us what you have tried we can help you if you get stuck.

Read about Periods here: https://pandas.pydata.org/pandas-docs/st...eriod.html


count = 0
for year, z in df.groupby(df.index.map(lambda t: t.year)):
       for month, x  in df.groupby(df.index.map(lambda t: t.month)):
               for day, i in df.groupby(df.index.map(lambda t: t.day)):
                      for hour, j  in df.groupby(df.index.map(lambda t: t.hour)):
                          count = count + 1
This is what I was trying to do, problem is that it doesnt work as I thought it would be.
I want to check for example for the hour 14:00:00 of day 1 from my dataset and see how many values are in that range.Then get the average of them and store them in a new dataframe.Then keep doing this for every day and every hour in my dataset.
Reply
#7
import pandas as pd
from datetime import datetime
import numpy as np

def get_hour(string):
    """Gets hour from datestring"""
    hr = datetime.fromisoformat(string).hour
    return hr

def get_date(string):
    """Gets date from datestring"""
    date = datetime.strptime(string, '%Y-%m-%d %H:%M:%S.%f')
    return str(date.date())

def get_average(df):
    average = df['value'].sum()/len(df['value'])
    return average

df = pd.read_csv('pandas.csv', header = 0, names = ['time', 'value'])
print(df)
hr = [get_hour(i) for i in df['time']]
date = [get_date(i) for i in df['time']]

df['hour'] = hr  #I'm sure there is a way to vectorize this part
# df['hour'] = datetime.fromisoformat(df['time']).hour # does not work - need help here
                                                    # because something like this would be faster
df['date'] = date
df.drop('time', axis = 1, inplace = True) #cleanup

unique_dates = np.unique(df['date']) #returns list

for u in unique_dates: #this probably can be sped up too with large datasets - help!
    unique_df = df[df['date'] == u] #returns df containing only u
    unique_hr = np.unique(unique_df['hour']) #returns list
    for h in unique_hr:
        unique_hr_df = unique_df[unique_df['hour'] == h]
        average = get_average(unique_hr_df)
        print('Date:', u, 'Hour:', h, "Average:", average)
 
Gives the following output:
Output:
time value 0 2019-03-21 20:03:30.351000 22.6 1 2019-03-21 20:33:30.392000 22.7 2 2019-03-21 20:38:30.428000 22.6 3 2019-03-23 20:43:30.363000 22.7 4 2019-03-23 21:13:30.382000 22.6 5 2019-03-23 22:33:34.762000 22.5 6 2019-03-23 22:38:34.696000 22.6 Date: 2019-03-21 Hour: 20 Average: 22.633333333333336 Date: 2019-03-23 Hour: 20 Average: 22.7 Date: 2019-03-23 Hour: 21 Average: 22.6 Date: 2019-03-23 Hour: 22 Average: 22.55
Note I changed your original dataset so that it has 2 different dates, for illustrative purposes.

I am sure some parts of the code can be sped up - see comments in code above. If some pandas guru can give me some tips that would be appreciated.
Reply
#8
Hello again and thanks for trying to help me.I've tried what you are doing and seems good to me, but i get this error:
ValueError: Invalid isoformat string: '2019-03-21 14:00:00.011000 '
Note this is the first date of my .csv so this means it doesnt get through dates at all.
Any ideas on how to fix this?
Reply
#9
I plugged in the problematic string you pointed out into my csv file and it worked for me.

Can you post the entire traceback?

Also, check that your csv file is formatted like your OP post, but with commas separating the time and the value. My csv looks like this in Notepad:

time, value
2019-03-21 20:03:30.351000, 22.6
2019-03-21 20:33:30.392000, 22.7
2019-03-21 20:38:30.428000, 22.6
2019-03-23 20:43:30.363000, 22.7
2019-03-23 21:13:30.382000, 22.6
2019-03-23 22:33:34.762000, 22.5
2019-03-23 22:38:34.696000, 22.6
2019-03-21 14:00:00.011000, 22.7
Otherwise, you may need to tweak the code that gets the datetime info from the string, i.e. the get_date and get_hour function. Readup the datetime docs and see what works for you.
Reply
#10
Traceback (most recent call last):
  File "C:/Users/..../PycharmProjects/...../average_per_hour.py", line 25, in <module>
    hr = [get_hour(i) for i in df['time']]
  File "C:/Users/...../PycharmProjects/...../average_per_hour.py", line 25, in <listcomp>
    hr = [get_hour(i) for i in df['time']]
  File "C:/Users/...../PycharmProjects/...../average_per_hour.py", line 8, in get_hour
    hr = datetime.fromisoformat(string).hour
ValueError: Invalid isoformat string: '2019-03-21 14:00:00.011000 '
This is the traceback. My .csv file in notepad looks like this(same as yours)
time,value
2019-03-21 14:00:00.011000 ,22.4
2019-03-21 15:00:00.008000 ,22.8
2019-03-21 16:00:00.016000 ,23.0
2019-03-21 17:00:00.012000 ,22.9
2019-03-21 18:00:00.011000 ,22.8
2019-03-21 19:00:00.010000 ,22.7
2019-03-21 20:03:30.351000 ,22.6
2019-03-21 20:33:30.392000 ,22.7
2019-03-21 20:38:30.428000 ,22.6
2019-03-21 20:43:30.363000 ,22.7
2019-03-21 21:13:30.382000 ,22.6
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Date Time Series Help...Please spra8560 2 312 Feb-01-2024, 01:38 PM
Last Post: spra8560
  Print names in x-axis of a time-series values hobbyist 4 1,177 Apr-22-2023, 09:29 PM
Last Post: deanhystad
  Time series JasminQuinn 0 1,005 Apr-22-2022, 10:33 PM
Last Post: JasminQuinn
  How to read rainfall time series and insert missing data points MadsM 4 2,123 Jan-06-2022, 10:39 AM
Last Post: amdi40
  Plotting A Time Series With Shaded Recession Bars adamszymanski 1 3,119 Jan-24-2021, 09:08 PM
Last Post: nealc
  Tableau Time Series Prediction using Python Integration tobimarsh43 0 1,888 Jul-24-2020, 10:38 AM
Last Post: tobimarsh43
  Bode plot from time series experiment data discus 4 7,209 Jun-20-2020, 07:46 AM
Last Post: discus
  Convert quarterly time series to monthly time series donnertrud 1 5,102 May-22-2020, 10:16 AM
Last Post: pyzyx3qwerty
  Parsing Date/Time from Metar Reports with 6 hourly weather information Lawrence 0 2,289 May-03-2020, 08:15 PM
Last Post: Lawrence
  Time series manipulation SinPy 0 1,396 Apr-01-2020, 12:48 PM
Last Post: SinPy

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020