Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to match two CSV files
#1
I'm new to Python from VisualBasic, so excuse my basic question.

I am trying to match two CSV files, based on the data in columns P1-P5:

CSV#1:
Header Row1 = DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5
Row2 = data1, data2, data3, 1,2,3,4,5
etc...

CSV#2:
Header Row1 = ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
Row2 = ea1, ea2, ea3, 3,5,6,2,1
Row3 = db1, db2, db3, 1,2,3,4,5
etc...

I would like:

1) To match each row in CSV1 against it's corresponding data in columns P1-P5 in CSV2

2) Combined these matched rows to form a new row in a new file, CSV#3:

eg:-
CSV#3:
Header Row1 = DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5, ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
Row1 = data1, data2, data3, 1,2,3,4,5, db1, db2, db3, 1,2,3,4,5
etc...

I am lost how to do this!

Is there a way to do this as quick as possible, bearing in mind the CSV files are large (sometimes up to 99999 rows).

I found a module called "pandas" but it is beyond my current understanding.

Thank you very much :)
Reply
#2
read your files with pandas and then compare df1 and df2

pandas.DataFrame.equals

import pandas as pd

csv1 = "/path/to/csv1.csv"
csv2 = "/path/to/csv2.csv"

f1 = open(csv1, 'r')
with f1:
    df1 = pd.read_csv(f1, delimiter = ',', header=None)

f2 = open(csv2, 'r')
with f2:
    df2 = pd.read_csv(f2, delimiter = ',', header=None)

or use

from pandas.testing import assert_frame_equal
c = assert_frame_equal(df1, df2, check_dtype=False)
print(c)
Reply
#3
Thanks for your reply.

I used the code as mentioned (Python 2.7.15+ on ubuntu) and got this:

Error:
Traceback (most recent call last): File "a.py", line 30, in <module> c = assert_frame_equal(df1, df2, check_dtype=False) File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1502, in assert_frame_equal obj='DataFrame.iloc[:, {idx}]'.format(idx=i)) File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1336, in assert_series_equal obj='{obj}'.format(obj=obj)) File "pandas/_libs/testing.pyx", line 66, in pandas._libs.testing.assert_almost_equal File "pandas/_libs/testing.pyx", line 180, in pandas._libs.testing.assert_almost_equal File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1094, in raise_assert_detail raise AssertionError(msg) AssertionError: DataFrame.iloc[:, 0] are different DataFrame.iloc[:, 0] values are different (100.0 %) [left]: [DataCol1, data1, data1] [right]: [ColName1, ea1, db1]

BTW - I'm not tied to using Pandas. If there is another simpler way, that's fine
Reply
#4
Do you want match by row numbers i.e first data row in first file with first data row in second file and so on?
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
Match P1-P5 in the first row in CSV1 to P1-P5 in *any* row in CSV2
Then:
Match P1-P5 in the second row in CSV1 to P1-P5 in *any* row in CSV2

...and so on until all rows in CSV1 have been gone thru
Reply
#6
Are the column values in one row unique? Are all rows unique within one file?

EDIT: I see two relatively simple approaches: column values as dictionary keys or set. But in order to evaluate whether these datastructures are applicable one must understand the data in files.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#7
An alternative approach - upload both files in database (two tables) and query with JOIN ON the respective columns that should be equal to match records
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#8
I made at home an example, but it's over complicated for you.
I guess you won't understand the half of the code.

One thought afterwards was, to use the DictReader and merge the OrderedDictes per row of each file.
Then duplicated keys will disappear.

My first approach was to use a mask to select the fields for each file to merge.
After work I'll add the code.
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#9
import io # fake file
import sys
import csv
from contextlib import ExitStack
from itertools import (
    chain,
    compress,
    repeat,
)


def combine_rows(rows, masks):
    return tuple(
        chain.from_iterable(
            compress(row, mask)
            for row, mask
            in zip(rows, masks)
        )
    )


def automatic_mask_gen(headers):
    iterator = iter(headers)
    selected = [*next(iterator)]
    masks = [tuple(repeat(1, len(selected[0])))]
    for header in iterator:
        mask = []
        for topic in header:
            if topic in selected:
                mask.append(0)
            else:
                mask.append(1)
                selected.append(topic)
        masks.append(tuple(mask))
    return masks


def automatic_sort_gen(flat_cols):
    return tuple(flat_cols.index(fls) for fls in sorted(flat_cols))


def sort_with_mask(cols, mask):
    return tuple(cols[idx] for idx in mask)


def strip_whitespace(rows):
    return [[col.strip() for col in cols] for cols in rows]


def combine_same_fields(csv_files, masks=None, delimiter=',', strip=True, autosort=True):
    if masks is not None and len(csv_files) != len(masks):
        raise ValueError('csv_files must be the same amout as masks.')
    with ExitStack() as stack:
        fds = [
            stack.enter_context(open(file)) if not isinstance(file, io.StringIO) else file
            for file in csv_files
            ]
        csv_iterators = [csv.reader(fd, delimiter=delimiter) for fd in fds]
        try:
            headers = [next(iterator) for iterator in csv_iterators]
        except StopIteration:
            raise Exception('Not able to get the header of one of the csv-files', file=sys.stdout)
        if strip:
            headers = strip_whitespace(headers)
        if masks is None:
            masks = automatic_mask_gen(headers)
        if autosort:
            combined_headers = combine_rows(headers, masks)
            sort_mask = automatic_sort_gen(combined_headers)
            yield sort_with_mask(combined_headers, sort_mask)
        else:
            yield combine_rows(headers, masks)
        for rows in zip(*csv_iterators):
            if strip:
                rows = strip_whitespace(rows)
            if autosort:
                yield sort_with_mask(combine_rows(rows, masks), sort_mask)
            else:
                yield combine_rows(rows, masks)

CSV_1 = io.StringIO("""
DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5
data1, data2, data3, 1,2,3,4,5
""".strip()) # fake file 1


CSV_2 = io.StringIO("""
ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
ea1, ea2, ea3, 3,5,6,2,1
db1, db2, db3, 1,2,3,4,5
""".strip()) # fake file 2

# you can have more

# mask for 2 csv data files
masks = [(1,1,1, 1,1,1,1,1), (1,1,1, 0,0,0,0,0)]
# this is the resulting generator object
iterator = combine_same_fields([CSV_1, CSV_2], masks=None)
# open a output file in write mode, create the csv.writer
# object and iterate over the iterator which is the generator.
# For each combined row yielded from iterator,
# is written by the writer object with the method writerow
with open('output.csv', 'w') as fd:
    writer = csv.writer(fd, delimiter=',')
    for row in iterator:
        writer.writerow(row)

# done


# show stored data
# small example with dict csv reader
with open('output.csv') as fd:
    print(fd.read())
    fd.seek(0) # back to start of file
    # example with dict reader
    reader = csv.DictReader(fd)
    print('Fields of reader:', reader.fieldnames)
    for row in reader:
        print(row)
Reading my own code days later Huh Big Grin Big Grin
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#10
Hey DeaD_EyE , thanks for that code. I shall give it a go :)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Look for match in two files and print out in the first file Batistuta 0 418 Mar-03-2020, 02:27 PM
Last Post: Batistuta
  Open and read multiple text files and match words kozaizsvemira 2 2,547 Sep-11-2019, 12:58 PM
Last Post: kozaizsvemira
  Compare two large CSV files for a match Python_Newbie9 3 2,682 Apr-22-2019, 08:49 PM
Last Post: ichabod801
  Match CSV files for difference Cuz 4 1,309 Dec-18-2018, 02:16 PM
Last Post: Cuz

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020