Posts: 101
Threads: 48
Joined: May 2018
Hi everyone,
I am working on cleaning a data set which has amounts entered for each day. When there is a missing date for the site ref there is just no row. Example in table below:
import pandas as pd
data = {
'Date': ['1/12/2002', '2/12/2002', '3/12/2002', '1/12/2002', '6/01/2003', '9/01/2003', '5/01/2003', '6/01/2003', '7/01/2003'],
'Site ref': [1, 1, 1, 2, 1, 1, 2, 2, 2],
'amount': [5, 6, 7, 3, 2, 8, 7, 9, 1]
} I'd like to add rows for the site ref where the dates are missing and then fill in the amount based on an average of the last 7 days. it gets challenging because there are rows missing for a time period that is larger than 7 days so the amount will be based on averages of averages. I'd like the date to end based on the latest date in the dataframe, in this example 7/01/2003.
Does anyone know how to do this?
Real data below:
https://opendata-nzta.opendata.arcgis.co...Table=true
I appreciate any help.
Posts: 1,091
Threads: 143
Joined: Jul 2017
Quote:Question: How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!
Quote:Obligatory disclaimer from the documentation
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:
Look for a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing, …
When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply() instead of iterating over the values. See the docs on function application.
If you need to do iterative manipulations on the values but performance is important, consider writing the inner loop with cython or numba. See the enhancing performance section for some examples of this approach.
What exactly would you like to do with the data? Your link has 5,446,602 rows, starting with 12/31/2017, 12:00 PM
You want a df with a dates column where the dates are continuous and sequential, starting from '1/12/2002'?
There are a lot of dates missing from '1/12/2002' to '7/01/2003'?
Then you want to fill in the missing data with averages of what? How would that be representative of actual events?
Maybe you can give a clear, small example of what you want to do?
Posts: 101
Threads: 48
Joined: May 2018
Basically, I want to get the sums of the traffic count by site reference and vehicle class for each month. As part of the data cleaning I filter out site references that have less than 60 months of traffic activity, I then filter out site reference that have been duplicated (not many, only like 8 have this issues). I also filter out 2017-12 because the counts are weird so the data starts from Jan 2018. After all these filters the dataframe is probably still too big for a loop so keen to hear any ideas about how to go about this.
For site reference with no traffic, it should have 0. When there is no data for the day (i.e. row missing it is assumed the meter was down for that date.) I want to fill in the data for when the meter was down by taking the average of the last 7 days for that site reference by vehicle (heavy, light) and using that amount.
Does that clarify? I can share all my code if that is unclear.
Thanks
Posts: 1,091
Threads: 143
Joined: Jul 2017
Quote:For site reference with no traffic, it should have 0.
Pandas has fillna()
This will replace all NaN with zero everywhere.
df.fillna(0) 60 months = 5 years = 5 * 365 + 1 or 2 depending on the number of leap years
2018 to 2024 is only 5 1/2 years. How many sites have continuous sequential dates for more than 5 years?
Won't you be filtering out almost everything?
Posts: 101
Threads: 48
Joined: May 2018
I know about fillna but remember there are no rows for the missing dates. fillna is for then you have a column that has a NaN value you want to replace with a 0.
The filtering for the 60 months is done by counting the dates by site reference. The date variable takes the raw date and converts to month and year. The site reference has to have no amount for the whole month for that month to have a missing month. Below is some of the code to clarify:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#https://opendata-nzta.opendata.arcgis.com/datasets/NZTA::tms-daily-traffic-counts-api/explore?showTable=true
path = r'C:\Users\SGrah\OneDrive\Documents\Python Scripts\Data\Economics data\TMS_Telemetry_Sites_4135033762041494366.csv'
df = pd.read_csv(path, header=0, usecols=[0,1,2,3,4,5,6,7,8,9], index_col=[0])
df2=df
df2['Start Date'] =pd.to_datetime(df['Start Date'])
df2['Day'] = df2['Start Date'].dt.day
df2['Month'] = df2['Start Date'].dt.month
df2['Year'] = df2['Start Date'].dt.year
df2['Date'] = pd.to_datetime(df['Year'].astype(str) +
df2['Month'].astype(str), format='%Y%m')
df2.drop(['Day','Month','Year'], axis=1, inplace=True)
df2['Date'] = pd.to_datetime(df2['Date'], format='%Y-%m-%d')
df2 = pd.DataFrame(df2)
# Pivot the DataFrame
pivot_df = df2.pivot_table(index=['Date', 'Site Description'], columns='Class Weight', values='Traffic Count', aggfunc='sum', fill_value=0)
pivot_df.columns = [f'{col}_class' for col in pivot_df.columns]
# Reset index to get Date and id back as columns
pivot_df = pivot_df.reset_index()
df3 = pivot_df
df3['count'] = df3.groupby(['Site Description'])['Date'].transform('count')
Does give a better idea of how the count filters? To filter this way leaves you about 30% of the original data.
Thanks
Posts: 1,091
Threads: 143
Joined: Jul 2017
Seems to me it might be a good idea to save all entries for each Site Alias as its own csv. (I am not used to dealing with enormous quantities of data!)
I know you said you don't want dates from 2017, but just as an example which does not have millions of rows, get all Site Alias 39 rows:
import pandas as pd
path2csv = '/home/pedro/myPython/pandas/csv_files/TMS_Telemetry_Sites_4135033762041494366.csv'
savepath = '/home/pedro/myPython/pandas/csv_files/'
df = pd.read_csv(path2csv, header=0, usecols=[0,1,2,3,4,5,6,7,8,9], index_col=[0])
df.shape # returns (5446602, 9)
df.columns
"""
Index(['Start Date', 'Site Alias', 'Region Name', 'Site Reference',
'Class Weight', 'Site Description', 'Lane Number', 'Flow Direction',
'Traffic Count'],
dtype='object')"""
# get 1 site by Site Alias
df39 = df[df['Site Alias'] == 39].copy()
df39.shape # returns (8760, 9)
df39.to_csv(savepath + '_df39.csv', encoding='utf-8') Now, what exactly do you want to do with df39?
If you look, at least some dates are the wrong way round, some 2022 dates after 2024:
Output: df39
Start Date Site Alias ... Flow Direction Traffic Count
OBJECTID ...
1 12/31/2017 12:00:00 PM 39 ... 2 2468.0
2 12/31/2017 12:00:00 PM 39 ... 2 73.0
3 12/31/2017 12:00:00 PM 39 ... 1 74.0
4 12/31/2017 12:00:00 PM 39 ... 1 3806.0
1125 1/1/2018 12:00:00 PM 39 ... 2 101.5
... ... ... ... ... ...
5413355 5/13/2024 12:00:00 AM 39 ... 1 362.5
5431250 11/28/2022 12:00:00 AM 39 ... 2 476.0
5433078 11/28/2022 12:00:00 AM 39 ... 1 3255.5
5436538 11/28/2022 12:00:00 AM 39 ... 1 368.5
5436651 11/28/2022 12:00:00 AM 39 ... 2 3342.0
Posts: 101
Threads: 48
Joined: May 2018
I've been told to use site reference so it might be better to filter to one of those.
I'm not to sure why the ordering of the start date is that way but i could be the site alias are the machines used to record the traffic so it might be when the take it and change the location giving it a new site reference and site description.
But what I want to do is take a site reference and when there is a date missing use the average of the traffic count over the last 7 days to fill in the missing traffic count column by the class weight. All other fields would remain the same. To make things easier I remove all the fields I don't need and put the class weight into columns with the values in that column the traffic count (as per code above).
Thanks
Posts: 1,091
Threads: 143
Joined: Jul 2017
There are these numbers of different Site Alias and Site Reference. Not sure why they are not the same! Data problem?
ids = set(df['Site Alias'])
len(ids) # returns 1727
ids = set(df['Site Reference'])
len(ids) # returns 1775 If you want to work by Site Reference, you could save 1775 different csvs and work on them individually. That might be better than adding rows to the 5,446,602 rows you already have! But I don't suppose Pandas minds!
It is easier for me to imagine when I am only dealing with 1 Site Reference!
Sort them to get the dates you have in order:
df39_sorted_by_date = df39.sort_values(by='Start Date', ascending=True, inplace=True).copy() You still have to distinguish / sort by Light and Heavy and Lane Number, because these have the same dates.
Then do the wizardry of finding non-sequential dates, put rows until the sequence is good, then do your averaging stuff?
If you work on much smaller csvs, the task does not seem so daunting to me!
Posts: 1,091
Threads: 143
Joined: Jul 2017
Back again, this interested me!
I never work on such big data, so I split the original csv into 1775 smaller csvs, according to the column Site Reference. Took more than 40 minutes on my laptop. Was 1775 loops through 5 1/2 million rows!
1 of those smaller files is for Site Reference 01B00015: 01B00015.csv
So now I am working on 01B00015.csv as an example, an example which you probably don't want, because it is small, only 416 rows, so just over 1 year of data. But the principle would remain the same for all Site Reference you want to examine:
Make a df, all_dates_df, which contains all the dates associated with Site Reference 01B00015, from the earliest date, to the latest date and every single day in between those dates.
Combine the pivot_df with all_dates_df where the dates are the same. Then you can easily see where data is missing and think about what you want to do in that case: add average data I believe you said. Basically cook the books!
I would not like to add more rows to the 5 and 1/2 million you already have, so I would save the amalgamated data for each Site Reference as its own file.
Goes like this:
import pandas as pd
savepath = '/home/pedro/myPython/pandas/csv_files/TMS/'
df = pd.read_csv(savepath + '01B00015.csv')
df.shape # (416, 10)
# gaps in column names can cause problems get rid of them
cols = list(df.columns)
newcols = [col.replace(' ', '_') for col in cols]
for i in range(len(cols)):
df.rename(columns={cols[i]:newcols[i]}, inplace=True)
# make sure column Start_Date is datetime
df.Start_Date.dtype # returns dtype('O')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
# sort by date ascending
df = df.sort_values(by='Start_Date', ascending=True)
# get the earliest and latest dates for making all_dates_df
# only 2 rows
df2 = df[0::len(df)-1 if len(df) > 1 else 1]
start = df2.iloc[0]['Start_Date'] # like Timestamp('2018-04-04 12:00:00')
end = df2.iloc[1]['Start_Date'] # like Timestamp('2024-03-06 00:00:00')
# pivot df on Class_Weight
pivot_df = df.pivot_table(index=['Start_Date'], columns='Class_Weight', values='Traffic_Count', aggfunc='sum', fill_value=0)
pivot_df = pivot_df.reset_index()
pivot_df.shape # (124, 3)
pivotcols = pivot_df.columns
all_dates_df = pd.DataFrame(columns=pivotcols)
all_dates_df['Start_Date'] = pd.date_range(start=start, end=end)
all_dates_df.shape # shows (2163, 4)
all_dates_df.Start_Date.dtype # dtype('<M8[ns]')
pivot_df.Start_Date.dtype # dtype('<M8[ns]')
# now put the pivot data in all_dates_df where the dates are the same
all_dates_df.set_index('Start_Date').combine_first(pivot_df.set_index('Start_Date')).reset_index()
all_dates_df.to_csv(savepath + '01B00015_combined_data_by_date.csv', sep=',', encoding='utf-8', index=False) Still might need some tweaks, maybe remove the time from the datetime, that is irrelevant here, I believe.
Posts: 101
Threads: 48
Joined: May 2018
I worked out this issue. It took a lot of work and it is messy and probably not the best way to do it but it worked. If you think there is value in me posting it here for people to see in the future let me know and I can type it out. It will be a bit of work though.
Regards
|