Working with dates in pandas

Scott · Jul-21-2024, 10:08 PM

Hi everyone,

I am having a confusing issue in pandas which I'll highlight in the code below. You'll need to update the path to your own one. You can access the data from the http link.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#https://opendata-nzta.opendata.arcgis.com/datasets/NZTA::tms-daily-traffic-counts-api/explore?showTable=true
path = r'C:\Users\SGrah\OneDrive\Documents\Python Scripts\Data\Economics data\TMS_Telemetry_Sites_4135033762041494366.csv' 

df = pd.read_csv(path, header=0, usecols=[0,1,2,3,4,5,6,7,8,9], index_col=[0])
df.head()

df2=df
df2['Start Date'] =pd.to_datetime(df['Start Date'])

df2['Day'] = df2['Start Date'].dt.day
df2['Month'] = df2['Start Date'].dt.month
df2['Year'] = df2['Start Date'].dt.year

#First loc which shows dates
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2)

df2['Date'] = pd.to_datetime(df['Year'].astype(str) +
                                        df2['Month'].astype(str), format='%Y%m')

df2['Start Date2'] = pd.to_datetime(df['Year'].astype(str) +
                                        df2['Month'].astype(str) +
                                        df2['Day'].astype(str), format='%Y%m%d')

#df2.drop(['Day','Month','Year'], axis=1, inplace=True)
df2['Date'] = pd.to_datetime(df2['Date'], format='%Y-%m-%d')
df2['Start Date2'] = pd.to_datetime(df2['Start Date2'], format='%Y-%m-%d')
df2.head()

#second loc which shows the date and month being switched causing a duplicate.
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2)

Does anyone know why this issue is occurring? I need to be able to set the date correctly to stop this double up.

Thanks for any help

**deanhystad** · (This post was last modified: Jul-21-2024, 11:27 PM by deanhystad.)

df2 = df does not make a new dataframe. Use df2 = df.copy(). The default separator for a CSV file is a comma.

I am unclear about what the problem is. Is it when you do this?

df2['Date'] = pd.to_datetime(df2['Date'], format='%Y-%m-%d')
df2['Start Date2'] = pd.to_datetime(df2['Start Date2'], format='%Y-%m-%d')

df2['Date'] and df2['Start Date2'] are already datetime objects. They are not strings. Calling pd.datetime() for these columns does not make any sense.

Scott · Jul-21-2024, 11:52 PM

Yes that is where the issue is happening.

The issue is if you look at start_date2 on the loc filter it has mixed up the month and day for one of the rows causing it to duplicate.

I probably called pd.datetime() because it was the only way I knew how to change the format when I looked online.

**deanhystad** · Jul-22-2024, 03:29 AM

Can you post an example where it is getting the date wrong? What date causes the problem? I don't see a problem looking at the start of the data.

import pandas as pd
from io import StringIO

data = """OBJECTID,Start Date,Site Alias,Region Name,Site Reference,Class Weight,Site Description,Lane Number,Flow Direction,Traffic Count
1,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Light,200 m Nth of Bell Rd,2,2,2468
2,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Heavy,200 m Nth of Bell Rd,2,2,73
3,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Heavy,200 m Nth of Bell Rd,1,1,74
4,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Light,200 m Nth of Bell Rd,1,1,3806"""


df = pd.read_csv(StringIO(data), parse_dates=['Start Date'], index_col="OBJECTID")
df['Day'] = df['Start Date'].dt.day
df['Month'] = df['Start Date'].dt.month
df['Year'] = df['Start Date'].dt.year
df['Start Date2'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
print(df[['Start Date', 'Start Date2']])

Output:                  Start Date Start Date2
OBJECTID
1        2017-12-31 12:00:00  2017-12-31
2        2017-12-31 12:00:00  2017-12-31
3        2017-12-31 12:00:00  2017-12-31
4        2017-12-31 12:00:00  2017-12-31

Or maybe I don't know what I'm looking for.

Scott · Jul-22-2024, 07:06 AM

You won't see the issue if you just run the code on a small subset of the data. It only does the error on some records.

run this code to see the issue:

#second loc which shows the date and month being switched causing a duplicate.
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2)

Pedroski55 · Jul-22-2024, 07:24 AM

This was that 600MB csv file with 5.5 million rows I remember it from another question!

I split the big file into 1775 separate csvs according to 'Site Reference', to make it easier to handle. Took 40 minutes to do that on this computer!

Now if I read in 1 of the csvs:

import pandas as pd

savepath = '/home/pedro/myPython/pandas/csv_files/TMS/'
df = pd.read_csv(savepath + '01B00015.csv')
df.shape # (416, 10)
# gaps in column names can cause problems get rid of them
cols = list(df.columns)
newcols = [col.replace(' ', '_') for col in cols]
for i in range(len(cols)):    
    df.rename(columns={cols[i]:newcols[i]}, inplace=True)

# make sure column Start_Date is datetime
df.Start_Date.dtype # returns dtype('O')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df.Start_Date.dtype # returns dtype('<M8[ns]')
# sort by date ascending
df = df.sort_values(by='Start_Date', ascending=True)

Pedroski55 · Jul-22-2024, 09:58 AM

If I import the offending csv: 01N30420.csv

import pandas as pd

savepath = '/home/pedro/myPython/pandas/csv_files/TMS/'
#df = pd.read_csv(savepath + '01B00015.csv')
df = pd.read_csv(savepath + '01N30420.csv')
df.shape # (3314, 10)
# gaps in column names can cause problems get rid of them
cols = list(df.columns)
newcols = [col.replace(' ', '_') for col in cols]
for i in range(len(cols)):    
    df.rename(columns={cols[i]:newcols[i]}, inplace=True)

# make sure column Start_Date is datetime
df.Start_Date.dtype # returns dtype('O')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df.Start_Date.dtype # returns dtype('<M8[ns]')
# sort by date ascending
df = df.sort_values(by='Start_Date', ascending=True)

df.loc[(df['Traffic_Count'].isin([687.5,779.0]))]

Gives:

Output:      OBJECTID          Start_Date  ...  Flow_Direction Traffic_Count
22       19124 2018-01-11 12:00:00  ...               3         779.0
608     648670 2018-11-01 12:00:00  ...               3         687.5
883     987559 2019-04-11 12:00:00  ...               3         779.0
2749   3107228 2021-11-03 12:00:00  ...               3         779.0

[4 rows x 10 columns]

**Larz60+** · Jul-22-2024, 10:29 AM

Use chunks:
see pandas.read_csv
read section beginning with iterator

**deanhystad** · (This post was last modified: Jul-23-2024, 01:00 AM by deanhystad.)

Your date string is ambiguous. 2018111 can be day = 1, month = 11 or day = 11, month = 1. You could add separators or you could build Start Date2 differently. If you want to convert a datetime object to a dateobject I suggest using functions built into pandas for doing this.

import pandas as pd
from io import StringIO

data = StringIO(
"""OBJECTID,Start Date
1,2018-01-11 12:00:00
2,2018-11-01 12:00:00
3,2019-04-11 12:00:00
4,2021-11-03 12:00:00""")
df = pd.read_csv(data, parse_dates=["Start Date"], index_col="OBJECTID")
df["y"] = df["Start Date"].dt.year
df["m"] = df["Start Date"].dt.month
df["d"] = df["Start Date"].dt.day
df["Start Date 2"] = pd.to_datetime(df["y"].astype(str) + df["m"].astype(str) + df["d"].astype(str), format="%Y%m%d")
df["Start Date 3"] = df["Start Date"].dt.date
print(df)

Output:                  Start Date     y   m   d Start Date 2 Start Date 3
OBJECTID
1        2018-01-11 12:00:00  2018   1  11   2018-11-01   2018-01-11 
2        2018-11-01 12:00:00  2018  11   1   2018-11-01   2018-11-01 
3        2019-04-11 12:00:00  2019   4  11   2019-04-11   2019-04-11 
4        2021-11-03 12:00:00  2021  11   3   2021-11-03   2021-11-03

DataFrame.dt provides access to datetime-like properties for items in a series.
https://pandas.pydata.org/docs/reference...es.dt.html

dt.date extracts just the date information. Scroll down to the Time/date components section.
https://pandas.pydata.org/pandas-docs/st...eries.html

Next time you post, try to find a more convenient way for others to run your code and see the problem. You'll get quicker and better responses if others are not forced to download a 600 megabyte file before they can help. For example, this code demonstrates your problem.

import pandas as pd
from datetime import datetime

df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})
df["y"] = df["Start Date"].dt.year
df["m"] = df["Start Date"].dt.month
df["d"] = df["Start Date"].dt.day
df["Start Date2"] = pd.to_datetime(df["y"].astype(str) + df["m"].astype(str) + df["d"].astype(str), format="%Y%m%d")
print(df)

Output:  Start Date     y   m   d Start Date2
0 2018-01-11  2018   1  11  2018-11-01
1 2018-11-01  2018  11   1  2018-11-01

For completeness here's an example that uses separators to avoid confusion.

import pandas as pd
from datetime import datetime

df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})
df["y"] = df["Start Date"].dt.year.astype(str)
df["m"] = df["Start Date"].dt.month.astype(str)
df["d"] = df["Start Date"].dt.day.astype(str)
df["datestr"] = df[["y", "m", "d"]].agg("-".join, axis=1)
df["Start Date2"] = pd.to_datetime(df["datestr"], format="%Y-%m-%d")
print(df)

Output:  Start Date     y   m   d    datestr Start Date2
0 2018-01-11  2018   1  11  2018-1-11  2018-01-11
1 2018-11-01  2018  11   1  2018-11-1  2018-11-01

Scott · Jul-23-2024, 01:53 AM

Hey mate,

Thanks so much, it makes sense now. I was going crazy looking at it and not being able to figure out wtf was happening.

I appreciate the feedback, but I did not know the issue so could not have posted this example. df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})

Thanks again.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pandas module not working	Hass	2	2,966	Apr-14-2022, 03:13 PM Last Post: snippsat
	Working with dates	Scott	1	2,796	Apr-22-2019, 12:20 AM Last Post: Larz60+

Working with dates in pandas

User Panel Messages

Announcements