Python Forum
Working with dates in pandas
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Working with dates in pandas
#1
Hi everyone,

I am having a confusing issue in pandas which I'll highlight in the code below. You'll need to update the path to your own one. You can access the data from the http link.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#https://opendata-nzta.opendata.arcgis.com/datasets/NZTA::tms-daily-traffic-counts-api/explore?showTable=true
path = r'C:\Users\SGrah\OneDrive\Documents\Python Scripts\Data\Economics data\TMS_Telemetry_Sites_4135033762041494366.csv' 

df = pd.read_csv(path, header=0, usecols=[0,1,2,3,4,5,6,7,8,9], index_col=[0])
df.head()

df2=df
df2['Start Date'] =pd.to_datetime(df['Start Date'])

df2['Day'] = df2['Start Date'].dt.day
df2['Month'] = df2['Start Date'].dt.month
df2['Year'] = df2['Start Date'].dt.year

#First loc which shows dates
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2)

df2['Date'] = pd.to_datetime(df['Year'].astype(str) +
                                        df2['Month'].astype(str), format='%Y%m')

df2['Start Date2'] = pd.to_datetime(df['Year'].astype(str) +
                                        df2['Month'].astype(str) +
                                        df2['Day'].astype(str), format='%Y%m%d')

#df2.drop(['Day','Month','Year'], axis=1, inplace=True)
df2['Date'] = pd.to_datetime(df2['Date'], format='%Y-%m-%d')
df2['Start Date2'] = pd.to_datetime(df2['Start Date2'], format='%Y-%m-%d')
df2.head()

#second loc which shows the date and month being switched causing a duplicate.
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2)
Does anyone know why this issue is occurring? I need to be able to set the date correctly to stop this double up.

Thanks for any help
Reply
#2
df2 = df does not make a new dataframe. Use df2 = df.copy(). The default separator for a CSV file is a comma.

I am unclear about what the problem is. Is it when you do this?
df2['Date'] = pd.to_datetime(df2['Date'], format='%Y-%m-%d')
df2['Start Date2'] = pd.to_datetime(df2['Start Date2'], format='%Y-%m-%d')
df2['Date'] and df2['Start Date2'] are already datetime objects. They are not strings. Calling pd.datetime() for these columns does not make any sense.
Reply
#3
Yes that is where the issue is happening.

The issue is if you look at start_date2 on the loc filter it has mixed up the month and day for one of the rows causing it to duplicate.

I probably called pd.datetime() because it was the only way I knew how to change the format when I looked online.
Reply
#4
Can you post an example where it is getting the date wrong? What date causes the problem? I don't see a problem looking at the start of the data.
import pandas as pd
from io import StringIO

data = """OBJECTID,Start Date,Site Alias,Region Name,Site Reference,Class Weight,Site Description,Lane Number,Flow Direction,Traffic Count
1,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Light,200 m Nth of Bell Rd,2,2,2468
2,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Heavy,200 m Nth of Bell Rd,2,2,73
3,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Heavy,200 m Nth of Bell Rd,1,1,74
4,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Light,200 m Nth of Bell Rd,1,1,3806"""


df = pd.read_csv(StringIO(data), parse_dates=['Start Date'], index_col="OBJECTID")
df['Day'] = df['Start Date'].dt.day
df['Month'] = df['Start Date'].dt.month
df['Year'] = df['Start Date'].dt.year
df['Start Date2'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
print(df[['Start Date', 'Start Date2']])
Output:
Start Date Start Date2 OBJECTID 1 2017-12-31 12:00:00 2017-12-31 2 2017-12-31 12:00:00 2017-12-31 3 2017-12-31 12:00:00 2017-12-31 4 2017-12-31 12:00:00 2017-12-31
Or maybe I don't know what I'm looking for.
Reply
#5
You won't see the issue if you just run the code on a small subset of the data. It only does the error on some records.

run this code to see the issue:

#second loc which shows the date and month being switched causing a duplicate.
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2)
Reply
#6
This was that 600MB csv file with 5.5 million rows I remember it from another question!

I split the big file into 1775 separate csvs according to 'Site Reference', to make it easier to handle. Took 40 minutes to do that on this computer!

Now if I read in 1 of the csvs:

import pandas as pd

savepath = '/home/pedro/myPython/pandas/csv_files/TMS/'
df = pd.read_csv(savepath + '01B00015.csv')
df.shape # (416, 10)
# gaps in column names can cause problems get rid of them
cols = list(df.columns)
newcols = [col.replace(' ', '_') for col in cols]
for i in range(len(cols)):    
    df.rename(columns={cols[i]:newcols[i]}, inplace=True)

# make sure column Start_Date is datetime
df.Start_Date.dtype # returns dtype('O')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df.Start_Date.dtype # returns dtype('<M8[ns]')
# sort by date ascending
df = df.sort_values(by='Start_Date', ascending=True)
Reply
#7
If I import the offending csv: 01N30420.csv


import pandas as pd

savepath = '/home/pedro/myPython/pandas/csv_files/TMS/'
#df = pd.read_csv(savepath + '01B00015.csv')
df = pd.read_csv(savepath + '01N30420.csv')
df.shape # (3314, 10)
# gaps in column names can cause problems get rid of them
cols = list(df.columns)
newcols = [col.replace(' ', '_') for col in cols]
for i in range(len(cols)):    
    df.rename(columns={cols[i]:newcols[i]}, inplace=True)

# make sure column Start_Date is datetime
df.Start_Date.dtype # returns dtype('O')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df.Start_Date.dtype # returns dtype('<M8[ns]')
# sort by date ascending
df = df.sort_values(by='Start_Date', ascending=True)

df.loc[(df['Traffic_Count'].isin([687.5,779.0]))]
Gives:

Output:
OBJECTID Start_Date ... Flow_Direction Traffic_Count 22 19124 2018-01-11 12:00:00 ... 3 779.0 608 648670 2018-11-01 12:00:00 ... 3 687.5 883 987559 2019-04-11 12:00:00 ... 3 779.0 2749 3107228 2021-11-03 12:00:00 ... 3 779.0 [4 rows x 10 columns]
Reply
#8
Use chunks:
see pandas.read_csv
read section beginning with iterator
Reply
#9
Your date string is ambiguous. 2018111 can be day = 1, month = 11 or day = 11, month = 1. You could add separators or you could build Start Date2 differently. If you want to convert a datetime object to a dateobject I suggest using functions built into pandas for doing this.
import pandas as pd
from io import StringIO

data = StringIO(
"""OBJECTID,Start Date
1,2018-01-11 12:00:00
2,2018-11-01 12:00:00
3,2019-04-11 12:00:00
4,2021-11-03 12:00:00""")
df = pd.read_csv(data, parse_dates=["Start Date"], index_col="OBJECTID")
df["y"] = df["Start Date"].dt.year
df["m"] = df["Start Date"].dt.month
df["d"] = df["Start Date"].dt.day
df["Start Date 2"] = pd.to_datetime(df["y"].astype(str) + df["m"].astype(str) + df["d"].astype(str), format="%Y%m%d")
df["Start Date 3"] = df["Start Date"].dt.date
print(df)
Output:
Start Date y m d Start Date 2 Start Date 3 OBJECTID 1 2018-01-11 12:00:00 2018 1 11 2018-11-01 2018-01-11 2 2018-11-01 12:00:00 2018 11 1 2018-11-01 2018-11-01 3 2019-04-11 12:00:00 2019 4 11 2019-04-11 2019-04-11 4 2021-11-03 12:00:00 2021 11 3 2021-11-03 2021-11-03
DataFrame.dt provides access to datetime-like properties for items in a series.
https://pandas.pydata.org/docs/reference...es.dt.html

dt.date extracts just the date information. Scroll down to the Time/date components section.
https://pandas.pydata.org/pandas-docs/st...eries.html

Next time you post, try to find a more convenient way for others to run your code and see the problem. You'll get quicker and better responses if others are not forced to download a 600 megabyte file before they can help. For example, this code demonstrates your problem.
import pandas as pd
from datetime import datetime

df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})
df["y"] = df["Start Date"].dt.year
df["m"] = df["Start Date"].dt.month
df["d"] = df["Start Date"].dt.day
df["Start Date2"] = pd.to_datetime(df["y"].astype(str) + df["m"].astype(str) + df["d"].astype(str), format="%Y%m%d")
print(df)
Output:
Start Date y m d Start Date2 0 2018-01-11 2018 1 11 2018-11-01 1 2018-11-01 2018 11 1 2018-11-01
For completeness here's an example that uses separators to avoid confusion.
import pandas as pd
from datetime import datetime

df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})
df["y"] = df["Start Date"].dt.year.astype(str)
df["m"] = df["Start Date"].dt.month.astype(str)
df["d"] = df["Start Date"].dt.day.astype(str)
df["datestr"] = df[["y", "m", "d"]].agg("-".join, axis=1)
df["Start Date2"] = pd.to_datetime(df["datestr"], format="%Y-%m-%d")
print(df)
Output:
Start Date y m d datestr Start Date2 0 2018-01-11 2018 1 11 2018-1-11 2018-01-11 1 2018-11-01 2018 11 1 2018-11-1 2018-11-01
Reply
#10
Hey mate,

Thanks so much, it makes sense now. I was going crazy looking at it and not being able to figure out wtf was happening.

I appreciate the feedback, but I did not know the issue so could not have posted this example. df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})

Thanks again.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pandas module not working Hass 2 2,966 Apr-14-2022, 03:13 PM
Last Post: snippsat
  Working with dates Scott 1 2,796 Apr-22-2019, 12:20 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020