Python Forum
Loop over dataframe to fill in missing rows
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Loop over dataframe to fill in missing rows
#1
Hi everyone,

I am working on cleaning a data set which has amounts entered for each day. When there is a missing date for the site ref there is just no row. Example in table below:

import pandas as pd
 
data = {
    'Date': ['1/12/2002', '2/12/2002', '3/12/2002', '1/12/2002', '6/01/2003', '9/01/2003', '5/01/2003', '6/01/2003', '7/01/2003'],
    'Site ref': [1, 1, 1, 2, 1, 1, 2, 2, 2],
    'amount': [5, 6, 7, 3, 2, 8, 7, 9, 1]
}
I'd like to add rows for the site ref where the dates are missing and then fill in the amount based on an average of the last 7 days. it gets challenging because there are rows missing for a time period that is larger than 7 days so the amount will be based on averages of averages. I'd like the date to end based on the latest date in the dataframe, in this example 7/01/2003.

Does anyone know how to do this?

Real data below:
https://opendata-nzta.opendata.arcgis.co...Table=true

I appreciate any help.
Reply
#2
Quote:Question: How to iterate over rows in a DataFrame in Pandas?

Answer: DON'T*!
Quote:Obligatory disclaimer from the documentation

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:

Look for a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing, …
When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply() instead of iterating over the values. See the docs on function application.
If you need to do iterative manipulations on the values but performance is important, consider writing the inner loop with cython or numba. See the enhancing performance section for some examples of this approach.

What exactly would you like to do with the data? Your link has 5,446,602 rows, starting with 12/31/2017, 12:00 PM

You want a df with a dates column where the dates are continuous and sequential, starting from '1/12/2002'?

There are a lot of dates missing from '1/12/2002' to '7/01/2003'?

Then you want to fill in the missing data with averages of what? How would that be representative of actual events?

Maybe you can give a clear, small example of what you want to do?
Reply
#3
Basically, I want to get the sums of the traffic count by site reference and vehicle class for each month. As part of the data cleaning I filter out site references that have less than 60 months of traffic activity, I then filter out site reference that have been duplicated (not many, only like 8 have this issues). I also filter out 2017-12 because the counts are weird so the data starts from Jan 2018. After all these filters the dataframe is probably still too big for a loop so keen to hear any ideas about how to go about this.

For site reference with no traffic, it should have 0. When there is no data for the day (i.e. row missing it is assumed the meter was down for that date.) I want to fill in the data for when the meter was down by taking the average of the last 7 days for that site reference by vehicle (heavy, light) and using that amount.

Does that clarify? I can share all my code if that is unclear.

Thanks
Reply
#4
Quote:For site reference with no traffic, it should have 0.

Pandas has fillna()

This will replace all NaN with zero everywhere.

df.fillna(0)
60 months = 5 years = 5 * 365 + 1 or 2 depending on the number of leap years

2018 to 2024 is only 5 1/2 years. How many sites have continuous sequential dates for more than 5 years?

Won't you be filtering out almost everything?
Reply
#5
I know about fillna but remember there are no rows for the missing dates. fillna is for then you have a column that has a NaN value you want to replace with a 0.

The filtering for the 60 months is done by counting the dates by site reference. The date variable takes the raw date and converts to month and year. The site reference has to have no amount for the whole month for that month to have a missing month. Below is some of the code to clarify:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#https://opendata-nzta.opendata.arcgis.com/datasets/NZTA::tms-daily-traffic-counts-api/explore?showTable=true
path = r'C:\Users\SGrah\OneDrive\Documents\Python Scripts\Data\Economics data\TMS_Telemetry_Sites_4135033762041494366.csv' 

df = pd.read_csv(path, header=0, usecols=[0,1,2,3,4,5,6,7,8,9], index_col=[0])

df2=df

df2['Start Date'] =pd.to_datetime(df['Start Date'])

df2['Day'] = df2['Start Date'].dt.day
df2['Month'] = df2['Start Date'].dt.month
df2['Year'] = df2['Start Date'].dt.year

df2['Date'] = pd.to_datetime(df['Year'].astype(str) +
                                        df2['Month'].astype(str), format='%Y%m')

df2.drop(['Day','Month','Year'], axis=1, inplace=True)
df2['Date'] = pd.to_datetime(df2['Date'], format='%Y-%m-%d')

df2 = pd.DataFrame(df2)
# Pivot the DataFrame
pivot_df = df2.pivot_table(index=['Date', 'Site Description'], columns='Class Weight', values='Traffic Count', aggfunc='sum', fill_value=0)
pivot_df.columns = [f'{col}_class' for col in pivot_df.columns]
# Reset index to get Date and id back as columns
pivot_df = pivot_df.reset_index()

df3 = pivot_df 

df3['count'] = df3.groupby(['Site Description'])['Date'].transform('count')


Does give a better idea of how the count filters? To filter this way leaves you about 30% of the original data.

Thanks
Reply
#6
Seems to me it might be a good idea to save all entries for each Site Alias as its own csv. (I am not used to dealing with enormous quantities of data!)

I know you said you don't want dates from 2017, but just as an example which does not have millions of rows, get all Site Alias 39 rows:

import pandas as pd

path2csv = '/home/pedro/myPython/pandas/csv_files/TMS_Telemetry_Sites_4135033762041494366.csv'
savepath = '/home/pedro/myPython/pandas/csv_files/'

df = pd.read_csv(path2csv, header=0, usecols=[0,1,2,3,4,5,6,7,8,9], index_col=[0])
df.shape # returns (5446602, 9)
df.columns
"""
Index(['Start Date', 'Site Alias', 'Region Name', 'Site Reference',
       'Class Weight', 'Site Description', 'Lane Number', 'Flow Direction',
       'Traffic Count'],
      dtype='object')"""
# get 1 site by Site Alias
df39 = df[df['Site Alias'] == 39].copy()
df39.shape # returns (8760, 9)
df39.to_csv(savepath + '_df39.csv', encoding='utf-8')
Now, what exactly do you want to do with df39?

If you look, at least some dates are the wrong way round, some 2022 dates after 2024:

Output:
df39 Start Date Site Alias ... Flow Direction Traffic Count OBJECTID ... 1 12/31/2017 12:00:00 PM 39 ... 2 2468.0 2 12/31/2017 12:00:00 PM 39 ... 2 73.0 3 12/31/2017 12:00:00 PM 39 ... 1 74.0 4 12/31/2017 12:00:00 PM 39 ... 1 3806.0 1125 1/1/2018 12:00:00 PM 39 ... 2 101.5 ... ... ... ... ... ... 5413355 5/13/2024 12:00:00 AM 39 ... 1 362.5 5431250 11/28/2022 12:00:00 AM 39 ... 2 476.0 5433078 11/28/2022 12:00:00 AM 39 ... 1 3255.5 5436538 11/28/2022 12:00:00 AM 39 ... 1 368.5 5436651 11/28/2022 12:00:00 AM 39 ... 2 3342.0
Reply
#7
I've been told to use site reference so it might be better to filter to one of those.

I'm not to sure why the ordering of the start date is that way but i could be the site alias are the machines used to record the traffic so it might be when the take it and change the location giving it a new site reference and site description.

But what I want to do is take a site reference and when there is a date missing use the average of the traffic count over the last 7 days to fill in the missing traffic count column by the class weight. All other fields would remain the same. To make things easier I remove all the fields I don't need and put the class weight into columns with the values in that column the traffic count (as per code above).

Thanks
Reply
#8
There are these numbers of different Site Alias and Site Reference. Not sure why they are not the same! Data problem?

ids = set(df['Site Alias'])
len(ids) # returns 1727
ids = set(df['Site Reference'])
len(ids) # returns 1775
If you want to work by Site Reference, you could save 1775 different csvs and work on them individually. That might be better than adding rows to the 5,446,602 rows you already have! But I don't suppose Pandas minds!

It is easier for me to imagine when I am only dealing with 1 Site Reference!

Sort them to get the dates you have in order:

df39_sorted_by_date = df39.sort_values(by='Start Date', ascending=True, inplace=True).copy()
You still have to distinguish / sort by Light and Heavy and Lane Number, because these have the same dates.

Then do the wizardry of finding non-sequential dates, put rows until the sequence is good, then do your averaging stuff?

If you work on much smaller csvs, the task does not seem so daunting to me!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pandas Dataframe Filtering based on rows mvdlm 0 1,567 Apr-02-2022, 06:39 PM
Last Post: mvdlm
  change dataframe header with 2 rows tonycat 2 2,133 Oct-29-2020, 01:41 AM
Last Post: tonycat
  Interate for loop over certain columns in dataframe Finpyth 2 2,106 Mar-06-2020, 08:34 AM
Last Post: Finpyth
  newbie: loop, modify dataframe cells expat_th 5 3,936 Mar-03-2020, 09:05 PM
Last Post: jefsummers
  read_csv error and rows/columns missing karlito 9 5,664 Nov-11-2019, 06:48 AM
Last Post: karlito
  How to add a few empty rows into a pandas dataframe python_newbie09 2 16,560 Sep-20-2019, 08:52 AM
Last Post: python_newbie09
  Double 'for' loop and writing in a new columns dataframe marco_ita 0 1,884 Sep-07-2019, 12:44 PM
Last Post: marco_ita
  Dataframe Rows Sorting stranger14u 1 2,671 Dec-17-2018, 11:47 PM
Last Post: scidam
  Write specific rows from pandas dataframe to csv file pradeepkumarbe 3 5,762 Oct-18-2018, 09:33 PM
Last Post: volcano63
  Add rows in dataframe arya_starc 1 2,485 Oct-03-2018, 08:02 AM
Last Post: volcano63

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020