Python Forum
Compare 2 Csv data sets, identify record with latest date
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Compare 2 Csv data sets, identify record with latest date
#1


High level I'm trying to compare two sets of data 100k approx and identify matching records in the second file that have a later date. I can import Csv and lookup using sets but can't figure out how to consider the latest date. What is the best way to approach this? E.g

File 1
[email protected], 1/1/17
[email protected] 2/3/16

File 2
[email protected]
[email protected] 9/9/17

Record [email protected] 9/9/17 identified.

Thanks.
Reply
#2
what have you tried so far?
show where you think the problems are located.
Reply
#3
The easiest way to compare dates, is to have date objects and compare them.
>>> import time
>>> start = time.strptime('2/3/16', '%m/%d/%y')
>>> end = time.strptime('9/9/17', '%m/%d/%y')
>>> start < end
True
Reply
#4
Cheers, got it, having a moment. Will try again tonight.
Reply
#5
I got this working, thanks:
import csv
import datetime

csv1 = ('c:\data\opt\csv1.csv')
csv2 = ('c:\data\opt\csv2.csv')
source = csv.reader(open(csv1,'rb'))
lookup = csv.reader(open(csv2,'rb'))

for data in lookup:
    lkeml = data[0]
    lkdt = datetime.datetime.strptime(data[1],"%d/%m/%y")
    for data1 in source:
        srceml = data1[0]
        srcdt = datetime.datetime.strptime(data1[1],"%d/%m/%y")
        if lkeml in srceml and lkdt > srcdt:
            print lkeml+"," +str(lkdt)
It identifies matches but the date includes time stamp even though it's not included in the formatting. I've had a look but can't find a specific method just for date, what am I missing?

source value:
[email protected], 2017/02/02
example output:
Output:
[email protected],2017-02-02 00:00:00
Thanks
Reply
#6
If the default formatting isn't what you want, then use the date object's strftime() method to let it know how you want it formatted:
>>> import datetime
>>> source = "2017/02/02"
>>> date = datetime.datetime.strptime(source, "%Y/%m/%d")
>>> str(date)
'2017-02-02 00:00:00'
>>> date.strftime("%Y-%m-%d")
'2017-02-02'
strptime: parses a date from a string, and
strftime: formats that date into a string.

The names are kind of terrible, but they're copied over almost exactly from the underlying C.  There's a few third party date libraries that make it nicer to work with, but for your use case, this is probably Good Enough™.
Reply
#7
Got it thanks, your explanation is much clearer than doc ;)
Reply
#8
Sometimes, the problem with docs is that they're written by people who already know what it does and how it works, so phrasing it in a way that helps someone new to it is hard for them :p
Reply
#9
I've started using this method for some rather large data sets to compare; 1st list 178k rows, 2nd 1.5million.

Performance is pretty bad, is there a better way of approaching this? I've started looking at Numpy but would some basic steps like list sort help?
Reply
#10
You could store the lookup file values in a dictionary like so:

import csv
import datetime

csv1 = ('c:\data\opt\csv1.csv')
csv2 = ('c:\data\opt\csv2.csv')
source = csv.reader(open(csv1,'rb'))
lookup = csv.reader(open(csv2,'rb'))

datefmt = "%d/%m/%y"

def infilter(reader):
    strptime = datetime.datetime.strptime
    for key, stm in reader:
        yield (key, strptime(stm, datefmt))

def make_lookdict(seq):
    lookdict = {}
    for key, tm in seq:
        if key in lookdict and lookdict[key] >= tm:
            continue
        else:
            lookdict[key] = tm
    return lookdict

def selected(lookdict, seq):
    for key, tm in seq:
        if key in lookdict and lookdict[key] < tm:
            yield (key, tm)

look = make_lookdict(infilter(lookup))
for key, tm in selected(look, infilter(source)):
    print(key, tm.strftime(datefmt))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to compare boxplot bw many data frames? vanphuht91 5 1,016 Jun-01-2023, 11:51 AM
Last Post: Larz60+
  Training a model to identify specific SMS types and extract relevant data? lord_of_cinder 0 975 Oct-10-2022, 04:35 AM
Last Post: lord_of_cinder
  replace sets of values in an array without using loops paul18fr 7 1,716 Jun-20-2022, 08:15 PM
Last Post: paul18fr
  Data sets comparison Fraetos 0 1,415 Sep-14-2021, 06:45 AM
Last Post: Fraetos
  Mann Whitney U-test on several data sets rybina 2 2,094 Jan-05-2021, 03:08 PM
Last Post: rybina
  Least-squares fit multiple data sets multiverse22 1 2,254 Jun-06-2020, 01:38 AM
Last Post: Larz60+
  How Do I Only Get the Year from Date and Isolate Data for Year? WhatsupSmiley 2 2,176 Apr-14-2020, 11:45 AM
Last Post: snippsat
  replace nan values by mean group by date.year, date.month wissam1974 5 8,449 Feb-19-2020, 06:25 PM
Last Post: AnkitGupta
  I am trying to change the value of an element in a record array ingu 1 2,171 Jan-14-2020, 01:30 PM
Last Post: perfringo
  Clustering for imbalanced data sets dervast 0 1,612 Sep-25-2019, 06:34 AM
Last Post: dervast

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020