High level I'm trying to compare two sets of data 100k approx and identify matching records in the second file that have a later date. I can import Csv and lookup using sets but can't figure out how to consider the latest date. What is the best way to approach this? E.g
File 1
[email protected], 1/1/17
[email protected] 2/3/16
File 2
[email protected]
[email protected] 9/9/17
Record
[email protected] 9/9/17 identified.
Thanks.
what have you tried so far?
show where you think the problems are located.
The easiest way to compare dates, is to have date objects and compare them.
>>> import time
>>> start = time.strptime('2/3/16', '%m/%d/%y')
>>> end = time.strptime('9/9/17', '%m/%d/%y')
>>> start < end
True
Cheers, got it, having a moment. Will try again tonight.
I got this working, thanks:
import csv
import datetime
csv1 = ('c:\data\opt\csv1.csv')
csv2 = ('c:\data\opt\csv2.csv')
source = csv.reader(open(csv1,'rb'))
lookup = csv.reader(open(csv2,'rb'))
for data in lookup:
lkeml = data[0]
lkdt = datetime.datetime.strptime(data[1],"%d/%m/%y")
for data1 in source:
srceml = data1[0]
srcdt = datetime.datetime.strptime(data1[1],"%d/%m/%y")
if lkeml in srceml and lkdt > srcdt:
print lkeml+"," +str(lkdt)
It identifies matches but the date includes time stamp even though it's not included in the formatting. I've had a look but can't find a specific method just for date, what am I missing?
source value:
[email protected], 2017/02/02
example output:
Output:
[email protected],2017-02-02 00:00:00
Thanks
If the default formatting isn't what you want, then use the date object's
strftime()
method to let it know how you want it formatted:
>>> import datetime
>>> source = "2017/02/02"
>>> date = datetime.datetime.strptime(source, "%Y/%m/%d")
>>> str(date)
'2017-02-02 00:00:00'
>>> date.strftime("%Y-%m-%d")
'2017-02-02'
str
ptime:
parses a date from a string, and
str
ftime:
formats that date into a string.
The names are kind of terrible, but they're copied over almost exactly from the underlying C. There's a few third party date libraries that make it nicer to work with, but for your use case, this is probably Good Enough.
Got it thanks, your explanation is much clearer than doc ;)
Sometimes, the problem with docs is that they're written by people who already know what it does and how it works, so phrasing it in a way that helps someone new to it is hard for them :p
I've started using this method for some rather large data sets to compare; 1st list 178k rows, 2nd 1.5million.
Performance is pretty bad, is there a better way of approaching this? I've started looking at Numpy but would some basic steps like list sort help?
You could store the lookup file values in a dictionary like so:
import csv
import datetime
csv1 = ('c:\data\opt\csv1.csv')
csv2 = ('c:\data\opt\csv2.csv')
source = csv.reader(open(csv1,'rb'))
lookup = csv.reader(open(csv2,'rb'))
datefmt = "%d/%m/%y"
def infilter(reader):
strptime = datetime.datetime.strptime
for key, stm in reader:
yield (key, strptime(stm, datefmt))
def make_lookdict(seq):
lookdict = {}
for key, tm in seq:
if key in lookdict and lookdict[key] >= tm:
continue
else:
lookdict[key] = tm
return lookdict
def selected(lookdict, seq):
for key, tm in seq:
if key in lookdict and lookdict[key] < tm:
yield (key, tm)
look = make_lookdict(infilter(lookup))
for key, tm in selected(look, infilter(source)):
print(key, tm.strftime(datefmt))