Python Forum

Full Version: pandas restricting csv read to certain rows
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Dear Pandas Experts,

I am tryig to extract data from a .csv file that contains columns called CarId, IssueDate

import pandas as pd
    train = pd.read_csv('train.csv', index_col=False, encoding="ISO-8859-1")
The issue date is of format "mm/dd/yyyy".
I want to get only those rows that have a year between 2012 and 2016.

Can someone help with that? I have no idea how to make this efficient i.e
maybe filtering before all the data is extracted.
You should look here:
https://stackoverflow.com/questions/1746...nize-dates
https://stackoverflow.com/questions/2937...-two-dates

import pandas as pd


df = pd.read_csv('dates.csv', delimiter=';', parse_dates=['date'])
print(df.dtypes)
mask =(df['date'] > pd.Timestamp(2012,1,1)) & (df['date'] < pd.Timestamp(2016,1,1))
print(df[mask])
Hi Dead_Eye,

Many thanks for your reply.
I tried what you suggested but even though there are years in range, nothing gets extracted.

    mask =(pd.DatetimeIndex(train_df['ticket_issued_date']).year > 2012) & (pd.DatetimeIndex(train_df['ticket_issued_date']).year < 2016)
    print(train_df[mask]) # is empty

    train_df['yearcolumn'] = pd.DatetimeIndex(train_df['ticket_issued_date']).year
    print(train_df['yearcolumn'].unique())
    #output [2004 2005 2006 2007 1938 1963 1988 2008 2009 2010 2011]
Can someone help with this? I wonder why the condition filters out all rows.
(Dec-12-2017, 01:22 PM)metalray Wrote: [ -> ]Can someone help with this? I wonder why the condition filters out all rows.
As always with Pandas question is simpler to answer question if provide sample input data that can be run.

Here generate dates from 2013 to 2017.
Take out dates from 2015 to 2016.
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((60,3)))
df['date'] = pd.date_range('2013-1-1', periods=60, freq='M')
mask = (df['date'] >= '09-01-2015') & (df['date'] <= '11-30-2016')
print(df.loc[mask])
Output:
           0         1         2       date 32  0.141371  0.146690  0.569466 2015-09-30 33  0.737064  0.233668  0.873330 2015-10-31 34  0.436252  0.494779  0.774684 2015-11-30 35  0.972824  0.590000  0.590881 2015-12-31 36  0.755075  0.409778  0.429689 2016-01-31 37  0.854687  0.807047  0.369351 2016-02-29 38  0.032468  0.362019  0.601016 2016-03-31 39  0.675107  0.063326  0.681630 2016-04-30 40  0.746973  0.055325  0.998878 2016-05-31 41  0.175564  0.967490  0.873183 2016-06-30 42  0.036599  0.011311  0.980002 2016-07-31 43  0.827632  0.246903  0.251752 2016-08-31 44  0.084068  0.974440  0.013576 2016-09-30 45  0.414893  0.853023  0.154171 2016-10-31 46  0.950556  0.973976  0.890843 2016-11-30
Got it. Thanks!