Posts: 101
Threads: 48
Joined: May 2018
Hi everyone,
I am having a confusing issue in pandas which I'll highlight in the code below. You'll need to update the path to your own one. You can access the data from the http link.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#https://opendata-nzta.opendata.arcgis.com/datasets/NZTA::tms-daily-traffic-counts-api/explore?showTable=true
path = r'C:\Users\SGrah\OneDrive\Documents\Python Scripts\Data\Economics data\TMS_Telemetry_Sites_4135033762041494366.csv'
df = pd.read_csv(path, header=0, usecols=[0,1,2,3,4,5,6,7,8,9], index_col=[0])
df.head()
df2=df
df2['Start Date'] =pd.to_datetime(df['Start Date'])
df2['Day'] = df2['Start Date'].dt.day
df2['Month'] = df2['Start Date'].dt.month
df2['Year'] = df2['Start Date'].dt.year
#First loc which shows dates
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2)
df2['Date'] = pd.to_datetime(df['Year'].astype(str) +
df2['Month'].astype(str), format='%Y%m')
df2['Start Date2'] = pd.to_datetime(df['Year'].astype(str) +
df2['Month'].astype(str) +
df2['Day'].astype(str), format='%Y%m%d')
#df2.drop(['Day','Month','Year'], axis=1, inplace=True)
df2['Date'] = pd.to_datetime(df2['Date'], format='%Y-%m-%d')
df2['Start Date2'] = pd.to_datetime(df2['Start Date2'], format='%Y-%m-%d')
df2.head()
#second loc which shows the date and month being switched causing a duplicate.
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2) Does anyone know why this issue is occurring? I need to be able to set the date correctly to stop this double up.
Thanks for any help
Posts: 6,779
Threads: 20
Joined: Feb 2020
Jul-21-2024, 11:27 PM
(This post was last modified: Jul-21-2024, 11:27 PM by deanhystad.)
df2 = df does not make a new dataframe. Use df2 = df.copy(). The default separator for a CSV file is a comma.
I am unclear about what the problem is. Is it when you do this?
df2['Date'] = pd.to_datetime(df2['Date'], format='%Y-%m-%d')
df2['Start Date2'] = pd.to_datetime(df2['Start Date2'], format='%Y-%m-%d') df2['Date'] and df2['Start Date2'] are already datetime objects. They are not strings. Calling pd.datetime() for these columns does not make any sense.
Posts: 101
Threads: 48
Joined: May 2018
Yes that is where the issue is happening.
The issue is if you look at start_date2 on the loc filter it has mixed up the month and day for one of the rows causing it to duplicate.
I probably called pd.datetime() because it was the only way I knew how to change the format when I looked online.
Posts: 6,779
Threads: 20
Joined: Feb 2020
Can you post an example where it is getting the date wrong? What date causes the problem? I don't see a problem looking at the start of the data.
import pandas as pd
from io import StringIO
data = """OBJECTID,Start Date,Site Alias,Region Name,Site Reference,Class Weight,Site Description,Lane Number,Flow Direction,Traffic Count
1,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Light,200 m Nth of Bell Rd,2,2,2468
2,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Heavy,200 m Nth of Bell Rd,2,2,73
3,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Heavy,200 m Nth of Bell Rd,1,1,74
4,12/31/2017 12:00:00 PM,39,05 - Gisborne,00200444,Light,200 m Nth of Bell Rd,1,1,3806"""
df = pd.read_csv(StringIO(data), parse_dates=['Start Date'], index_col="OBJECTID")
df['Day'] = df['Start Date'].dt.day
df['Month'] = df['Start Date'].dt.month
df['Year'] = df['Start Date'].dt.year
df['Start Date2'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
print(df[['Start Date', 'Start Date2']]) Output: Start Date Start Date2
OBJECTID
1 2017-12-31 12:00:00 2017-12-31
2 2017-12-31 12:00:00 2017-12-31
3 2017-12-31 12:00:00 2017-12-31
4 2017-12-31 12:00:00 2017-12-31
Or maybe I don't know what I'm looking for.
Posts: 101
Threads: 48
Joined: May 2018
You won't see the issue if you just run the code on a small subset of the data. It only does the error on some records.
run this code to see the issue:
#second loc which shows the date and month being switched causing a duplicate.
df2.loc[(df2['Site Reference'] == '01N30420') & (df2['Traffic Count'].isin([687.5,779.0]))].head(2)
Posts: 1,090
Threads: 143
Joined: Jul 2017
This was that 600MB csv file with 5.5 million rows I remember it from another question!
I split the big file into 1775 separate csvs according to 'Site Reference', to make it easier to handle. Took 40 minutes to do that on this computer!
Now if I read in 1 of the csvs:
import pandas as pd
savepath = '/home/pedro/myPython/pandas/csv_files/TMS/'
df = pd.read_csv(savepath + '01B00015.csv')
df.shape # (416, 10)
# gaps in column names can cause problems get rid of them
cols = list(df.columns)
newcols = [col.replace(' ', '_') for col in cols]
for i in range(len(cols)):
df.rename(columns={cols[i]:newcols[i]}, inplace=True)
# make sure column Start_Date is datetime
df.Start_Date.dtype # returns dtype('O')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df.Start_Date.dtype # returns dtype('<M8[ns]')
# sort by date ascending
df = df.sort_values(by='Start_Date', ascending=True)
Posts: 1,090
Threads: 143
Joined: Jul 2017
If I import the offending csv: 01N30420.csv
import pandas as pd
savepath = '/home/pedro/myPython/pandas/csv_files/TMS/'
#df = pd.read_csv(savepath + '01B00015.csv')
df = pd.read_csv(savepath + '01N30420.csv')
df.shape # (3314, 10)
# gaps in column names can cause problems get rid of them
cols = list(df.columns)
newcols = [col.replace(' ', '_') for col in cols]
for i in range(len(cols)):
df.rename(columns={cols[i]:newcols[i]}, inplace=True)
# make sure column Start_Date is datetime
df.Start_Date.dtype # returns dtype('O')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df.Start_Date.dtype # returns dtype('<M8[ns]')
# sort by date ascending
df = df.sort_values(by='Start_Date', ascending=True)
df.loc[(df['Traffic_Count'].isin([687.5,779.0]))] Gives:
Output: OBJECTID Start_Date ... Flow_Direction Traffic_Count
22 19124 2018-01-11 12:00:00 ... 3 779.0
608 648670 2018-11-01 12:00:00 ... 3 687.5
883 987559 2019-04-11 12:00:00 ... 3 779.0
2749 3107228 2021-11-03 12:00:00 ... 3 779.0
[4 rows x 10 columns]
Posts: 12,025
Threads: 484
Joined: Sep 2016
Use chunks:
see pandas.read_csv
read section beginning with iterator
Posts: 6,779
Threads: 20
Joined: Feb 2020
Jul-22-2024, 08:13 PM
(This post was last modified: Jul-23-2024, 01:00 AM by deanhystad.)
Your date string is ambiguous. 2018111 can be day = 1, month = 11 or day = 11, month = 1. You could add separators or you could build Start Date2 differently. If you want to convert a datetime object to a dateobject I suggest using functions built into pandas for doing this.
import pandas as pd
from io import StringIO
data = StringIO(
"""OBJECTID,Start Date
1,2018-01-11 12:00:00
2,2018-11-01 12:00:00
3,2019-04-11 12:00:00
4,2021-11-03 12:00:00""")
df = pd.read_csv(data, parse_dates=["Start Date"], index_col="OBJECTID")
df["y"] = df["Start Date"].dt.year
df["m"] = df["Start Date"].dt.month
df["d"] = df["Start Date"].dt.day
df["Start Date 2"] = pd.to_datetime(df["y"].astype(str) + df["m"].astype(str) + df["d"].astype(str), format="%Y%m%d")
df["Start Date 3"] = df["Start Date"].dt.date
print(df) Output: Start Date y m d Start Date 2 Start Date 3
OBJECTID
1 2018-01-11 12:00:00 2018 1 11 2018-11-01 2018-01-11
2 2018-11-01 12:00:00 2018 11 1 2018-11-01 2018-11-01
3 2019-04-11 12:00:00 2019 4 11 2019-04-11 2019-04-11
4 2021-11-03 12:00:00 2021 11 3 2021-11-03 2021-11-03
DataFrame.dt provides access to datetime-like properties for items in a series.
https://pandas.pydata.org/docs/reference...es.dt.html
dt.date extracts just the date information. Scroll down to the Time/date components section.
https://pandas.pydata.org/pandas-docs/st...eries.html
Next time you post, try to find a more convenient way for others to run your code and see the problem. You'll get quicker and better responses if others are not forced to download a 600 megabyte file before they can help. For example, this code demonstrates your problem.
import pandas as pd
from datetime import datetime
df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})
df["y"] = df["Start Date"].dt.year
df["m"] = df["Start Date"].dt.month
df["d"] = df["Start Date"].dt.day
df["Start Date2"] = pd.to_datetime(df["y"].astype(str) + df["m"].astype(str) + df["d"].astype(str), format="%Y%m%d")
print(df) Output: Start Date y m d Start Date2
0 2018-01-11 2018 1 11 2018-11-01
1 2018-11-01 2018 11 1 2018-11-01
For completeness here's an example that uses separators to avoid confusion.
import pandas as pd
from datetime import datetime
df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})
df["y"] = df["Start Date"].dt.year.astype(str)
df["m"] = df["Start Date"].dt.month.astype(str)
df["d"] = df["Start Date"].dt.day.astype(str)
df["datestr"] = df[["y", "m", "d"]].agg("-".join, axis=1)
df["Start Date2"] = pd.to_datetime(df["datestr"], format="%Y-%m-%d")
print(df) Output: Start Date y m d datestr Start Date2
0 2018-01-11 2018 1 11 2018-1-11 2018-01-11
1 2018-11-01 2018 11 1 2018-11-1 2018-11-01
Posts: 101
Threads: 48
Joined: May 2018
Hey mate,
Thanks so much, it makes sense now. I was going crazy looking at it and not being able to figure out wtf was happening.
I appreciate the feedback, but I did not know the issue so could not have posted this example. df = pd.DataFrame({"Start Date": (datetime(year=2018, month=1, day=11), datetime(year=2018, month=11, day=1))})
Thanks again.
|