Posts: 7
Threads: 1
Joined: Dec 2017
Dec-02-2017, 09:03 PM
(This post was last modified: Dec-02-2017, 09:06 PM by MJUk.)
High level I'm trying to compare two sets of data 100k approx and identify matching records in the second file that have a later date. I can import Csv and lookup using sets but can't figure out how to consider the latest date. What is the best way to approach this? E.g
File 1
[email protected], 1/1/17
[email protected] 2/3/16
File 2
[email protected]
[email protected] 9/9/17
Record [email protected] 9/9/17 identified.
Thanks.
Posts: 11,872
Threads: 474
Joined: Sep 2016
Dec-03-2017, 01:34 AM
(This post was last modified: Dec-03-2017, 01:34 AM by Larz60+.)
what have you tried so far?
show where you think the problems are located.
Posts: 3,458
Threads: 101
Joined: Sep 2016
The easiest way to compare dates, is to have date objects and compare them. >>> import time
>>> start = time.strptime('2/3/16', '%m/%d/%y')
>>> end = time.strptime('9/9/17', '%m/%d/%y')
>>> start < end
True
Posts: 7
Threads: 1
Joined: Dec 2017
Cheers, got it, having a moment. Will try again tonight.
Posts: 7
Threads: 1
Joined: Dec 2017
Dec-04-2017, 01:04 PM
(This post was last modified: Jan-03-2018, 10:49 PM by snippsat.)
I got this working, thanks:
import csv
import datetime
csv1 = ('c:\data\opt\csv1.csv')
csv2 = ('c:\data\opt\csv2.csv')
source = csv.reader(open(csv1,'rb'))
lookup = csv.reader(open(csv2,'rb'))
for data in lookup:
lkeml = data[0]
lkdt = datetime.datetime.strptime(data[1],"%d/%m/%y")
for data1 in source:
srceml = data1[0]
srcdt = datetime.datetime.strptime(data1[1],"%d/%m/%y")
if lkeml in srceml and lkdt > srcdt:
print lkeml+"," +str(lkdt) It identifies matches but the date includes time stamp even though it's not included in the formatting. I've had a look but can't find a specific method just for date, what am I missing?
source value:
[email protected], 2017/02/02 example output:
Output: [email protected],2017-02-02 00:00:00
Thanks
Posts: 3,458
Threads: 101
Joined: Sep 2016
Dec-04-2017, 05:09 PM
(This post was last modified: Dec-04-2017, 05:09 PM by nilamo.)
If the default formatting isn't what you want, then use the date object's strftime() method to let it know how you want it formatted:
>>> import datetime
>>> source = "2017/02/02"
>>> date = datetime.datetime.strptime(source, "%Y/%m/%d")
>>> str(date)
'2017-02-02 00:00:00'
>>> date.strftime("%Y-%m-%d")
'2017-02-02' str ptime: parses a date from a string, and
str ftime: formats that date into a string.
The names are kind of terrible, but they're copied over almost exactly from the underlying C. There's a few third party date libraries that make it nicer to work with, but for your use case, this is probably Good Enough.
Posts: 7
Threads: 1
Joined: Dec 2017
Got it thanks, your explanation is much clearer than doc ;)
Posts: 3,458
Threads: 101
Joined: Sep 2016
Sometimes, the problem with docs is that they're written by people who already know what it does and how it works, so phrasing it in a way that helps someone new to it is hard for them :p
Posts: 7
Threads: 1
Joined: Dec 2017
I've started using this method for some rather large data sets to compare; 1st list 178k rows, 2nd 1.5million.
Performance is pretty bad, is there a better way of approaching this? I've started looking at Numpy but would some basic steps like list sort help?
Posts: 4,488
Threads: 69
Joined: Jan 2018
Jan-03-2018, 11:20 PM
(This post was last modified: Jan-03-2018, 11:20 PM by Gribouillis.)
You could store the lookup file values in a dictionary like so:
import csv
import datetime
csv1 = ('c:\data\opt\csv1.csv')
csv2 = ('c:\data\opt\csv2.csv')
source = csv.reader(open(csv1,'rb'))
lookup = csv.reader(open(csv2,'rb'))
datefmt = "%d/%m/%y"
def infilter(reader):
strptime = datetime.datetime.strptime
for key, stm in reader:
yield (key, strptime(stm, datefmt))
def make_lookdict(seq):
lookdict = {}
for key, tm in seq:
if key in lookdict and lookdict[key] >= tm:
continue
else:
lookdict[key] = tm
return lookdict
def selected(lookdict, seq):
for key, tm in seq:
if key in lookdict and lookdict[key] < tm:
yield (key, tm)
look = make_lookdict(infilter(lookup))
for key, tm in selected(look, infilter(source)):
print(key, tm.strftime(datefmt))
|