Python Forum

Full Version: Compare 2 Csv data sets, identify record with latest date
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2


High level I'm trying to compare two sets of data 100k approx and identify matching records in the second file that have a later date. I can import Csv and lookup using sets but can't figure out how to consider the latest date. What is the best way to approach this? E.g

File 1
[email protected], 1/1/17
[email protected] 2/3/16

File 2
[email protected]
[email protected] 9/9/17

Record [email protected] 9/9/17 identified.

Thanks.
what have you tried so far?
show where you think the problems are located.
The easiest way to compare dates, is to have date objects and compare them.
>>> import time
>>> start = time.strptime('2/3/16', '%m/%d/%y')
>>> end = time.strptime('9/9/17', '%m/%d/%y')
>>> start < end
True
Cheers, got it, having a moment. Will try again tonight.
I got this working, thanks:
import csv
import datetime

csv1 = ('c:\data\opt\csv1.csv')
csv2 = ('c:\data\opt\csv2.csv')
source = csv.reader(open(csv1,'rb'))
lookup = csv.reader(open(csv2,'rb'))

for data in lookup:
    lkeml = data[0]
    lkdt = datetime.datetime.strptime(data[1],"%d/%m/%y")
    for data1 in source:
        srceml = data1[0]
        srcdt = datetime.datetime.strptime(data1[1],"%d/%m/%y")
        if lkeml in srceml and lkdt > srcdt:
            print lkeml+"," +str(lkdt)
It identifies matches but the date includes time stamp even though it's not included in the formatting. I've had a look but can't find a specific method just for date, what am I missing?

source value:
[email protected], 2017/02/02
example output:
Output:
[email protected],2017-02-02 00:00:00
Thanks
If the default formatting isn't what you want, then use the date object's strftime() method to let it know how you want it formatted:
>>> import datetime
>>> source = "2017/02/02"
>>> date = datetime.datetime.strptime(source, "%Y/%m/%d")
>>> str(date)
'2017-02-02 00:00:00'
>>> date.strftime("%Y-%m-%d")
'2017-02-02'
strptime: parses a date from a string, and
strftime: formats that date into a string.

The names are kind of terrible, but they're copied over almost exactly from the underlying C.  There's a few third party date libraries that make it nicer to work with, but for your use case, this is probably Good Enough™.
Got it thanks, your explanation is much clearer than doc ;)
Sometimes, the problem with docs is that they're written by people who already know what it does and how it works, so phrasing it in a way that helps someone new to it is hard for them :p
I've started using this method for some rather large data sets to compare; 1st list 178k rows, 2nd 1.5million.

Performance is pretty bad, is there a better way of approaching this? I've started looking at Numpy but would some basic steps like list sort help?
You could store the lookup file values in a dictionary like so:

import csv
import datetime

csv1 = ('c:\data\opt\csv1.csv')
csv2 = ('c:\data\opt\csv2.csv')
source = csv.reader(open(csv1,'rb'))
lookup = csv.reader(open(csv2,'rb'))

datefmt = "%d/%m/%y"

def infilter(reader):
    strptime = datetime.datetime.strptime
    for key, stm in reader:
        yield (key, strptime(stm, datefmt))

def make_lookdict(seq):
    lookdict = {}
    for key, tm in seq:
        if key in lookdict and lookdict[key] >= tm:
            continue
        else:
            lookdict[key] = tm
    return lookdict

def selected(lookdict, seq):
    for key, tm in seq:
        if key in lookdict and lookdict[key] < tm:
            yield (key, tm)

look = make_lookdict(infilter(lookup))
for key, tm in selected(look, infilter(source)):
    print(key, tm.strftime(datefmt))
Pages: 1 2