How to match two CSV files

timlamont · Sep-29-2019, 08:25 AM

I'm new to Python from VisualBasic, so excuse my basic question.

I am trying to match two CSV files, based on the data in columns P1-P5:

CSV#1:
Header Row1 = DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5
Row2 = data1, data2, data3, 1,2,3,4,5
etc...

CSV#2:
Header Row1 = ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
Row2 = ea1, ea2, ea3, 3,5,6,2,1
Row3 = db1, db2, db3, 1,2,3,4,5
etc...

I would like:

1) To match each row in CSV1 against it's corresponding data in columns P1-P5 in CSV2

2) Combined these matched rows to form a new row in a new file, CSV#3:

eg:-
CSV#3:
Header Row1 = DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5, ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
Row1 = data1, data2, data3, 1,2,3,4,5, db1, db2, db3, 1,2,3,4,5
etc...

I am lost how to do this!

Is there a way to do this as quick as possible, bearing in mind the CSV files are large (sometimes up to 99999 rows).

I found a module called "pandas" but it is beyond my current understanding.

Thank you very much :)

Axel_Erfurt · (This post was last modified: Sep-29-2019, 09:29 AM by Axel_Erfurt.)

read your files with pandas and then compare df1 and df2

pandas.DataFrame.equals

import pandas as pd

csv1 = "/path/to/csv1.csv"
csv2 = "/path/to/csv2.csv"

f1 = open(csv1, 'r')
with f1:
    df1 = pd.read_csv(f1, delimiter = ',', header=None)

f2 = open(csv2, 'r')
with f2:
    df2 = pd.read_csv(f2, delimiter = ',', header=None)

or use

from pandas.testing import assert_frame_equal
c = assert_frame_equal(df1, df2, check_dtype=False)
print(c)

timlamont · (This post was last modified: Sep-29-2019, 10:01 AM by timlamont.)

Thanks for your reply.

I used the code as mentioned (Python 2.7.15+ on ubuntu) and got this:

Error:Traceback (most recent call last):
  File "a.py", line 30, in <module>
    c = assert_frame_equal(df1, df2, check_dtype=False)
  File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1502, in assert_frame_equal
    obj='DataFrame.iloc[:, {idx}]'.format(idx=i))
  File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1336, in assert_series_equal
    obj='{obj}'.format(obj=obj))
  File "pandas/_libs/testing.pyx", line 66, in pandas._libs.testing.assert_almost_equal
  File "pandas/_libs/testing.pyx", line 180, in pandas._libs.testing.assert_almost_equal
  File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1094, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: DataFrame.iloc[:, 0] are different

DataFrame.iloc[:, 0] values are different (100.0 %)
[left]:  [DataCol1, data1, data1]
[right]: [ColName1, ea1, db1]

BTW - I'm not tied to using Pandas. If there is another simpler way, that's fine

**perfringo** · Sep-29-2019, 10:21 AM

Do you want match by row numbers i.e first data row in first file with first data row in second file and so on?

timlamont · Sep-29-2019, 10:39 AM

Match P1-P5 in the first row in CSV1 to P1-P5 in *any* row in CSV2
Then:
Match P1-P5 in the second row in CSV1 to P1-P5 in *any* row in CSV2

...and so on until all rows in CSV1 have been gone thru

**perfringo** · (This post was last modified: Sep-30-2019, 05:02 AM by perfringo.)

Are the column values in one row unique? Are all rows unique within one file?

EDIT: I see two relatively simple approaches: column values as dictionary keys or set. But in order to evaluate whether these datastructures are applicable one must understand the data in files.

**buran** · (This post was last modified: Sep-30-2019, 05:47 AM by buran.)

An alternative approach - upload both files in database (two tables) and query with JOIN ON the respective columns that should be equal to match records

DeaD_EyE · Sep-30-2019, 07:16 AM

I made at home an example, but it's over complicated for you.
I guess you won't understand the half of the code.

One thought afterwards was, to use the DictReader and merge the OrderedDictes per row of each file.
Then duplicated keys will disappear.

My first approach was to use a mask to select the fields for each file to merge.
After work I'll add the code.

DeaD_EyE · (This post was last modified: Oct-01-2019, 05:11 PM by DeaD_EyE.)

import io # fake file
import sys
import csv
from contextlib import ExitStack
from itertools import (
    chain,
    compress,
    repeat,
)


def combine_rows(rows, masks):
    return tuple(
        chain.from_iterable(
            compress(row, mask)
            for row, mask
            in zip(rows, masks)
        )
    )


def automatic_mask_gen(headers):
    iterator = iter(headers)
    selected = [*next(iterator)]
    masks = [tuple(repeat(1, len(selected[0])))]
    for header in iterator:
        mask = []
        for topic in header:
            if topic in selected:
                mask.append(0)
            else:
                mask.append(1)
                selected.append(topic)
        masks.append(tuple(mask))
    return masks


def automatic_sort_gen(flat_cols):
    return tuple(flat_cols.index(fls) for fls in sorted(flat_cols))


def sort_with_mask(cols, mask):
    return tuple(cols[idx] for idx in mask)


def strip_whitespace(rows):
    return [[col.strip() for col in cols] for cols in rows]


def combine_same_fields(csv_files, masks=None, delimiter=',', strip=True, autosort=True):
    if masks is not None and len(csv_files) != len(masks):
        raise ValueError('csv_files must be the same amout as masks.')
    with ExitStack() as stack:
        fds = [
            stack.enter_context(open(file)) if not isinstance(file, io.StringIO) else file
            for file in csv_files
            ]
        csv_iterators = [csv.reader(fd, delimiter=delimiter) for fd in fds]
        try:
            headers = [next(iterator) for iterator in csv_iterators]
        except StopIteration:
            raise Exception('Not able to get the header of one of the csv-files', file=sys.stdout)
        if strip:
            headers = strip_whitespace(headers)
        if masks is None:
            masks = automatic_mask_gen(headers)
        if autosort:
            combined_headers = combine_rows(headers, masks)
            sort_mask = automatic_sort_gen(combined_headers)
            yield sort_with_mask(combined_headers, sort_mask)
        else:
            yield combine_rows(headers, masks)
        for rows in zip(*csv_iterators):
            if strip:
                rows = strip_whitespace(rows)
            if autosort:
                yield sort_with_mask(combine_rows(rows, masks), sort_mask)
            else:
                yield combine_rows(rows, masks)

CSV_1 = io.StringIO("""
DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5
data1, data2, data3, 1,2,3,4,5
""".strip()) # fake file 1


CSV_2 = io.StringIO("""
ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
ea1, ea2, ea3, 3,5,6,2,1
db1, db2, db3, 1,2,3,4,5
""".strip()) # fake file 2

# you can have more

# mask for 2 csv data files
masks = [(1,1,1, 1,1,1,1,1), (1,1,1, 0,0,0,0,0)]
# this is the resulting generator object
iterator = combine_same_fields([CSV_1, CSV_2], masks=None)
# open a output file in write mode, create the csv.writer
# object and iterate over the iterator which is the generator.
# For each combined row yielded from iterator,
# is written by the writer object with the method writerow
with open('output.csv', 'w') as fd:
    writer = csv.writer(fd, delimiter=',')
    for row in iterator:
        writer.writerow(row)

# done


# show stored data
# small example with dict csv reader
with open('output.csv') as fd:
    print(fd.read())
    fd.seek(0) # back to start of file
    # example with dict reader
    reader = csv.DictReader(fd)
    print('Fields of reader:', reader.fieldnames)
    for row in reader:
        print(row)

Reading my own code days later Huh

timlamont · Oct-01-2019, 05:54 PM

Hey DeaD_EyE , thanks for that code. I shall give it a go :)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Move Files based on partial Match	mohamedsalih12	2	819	Sep-20-2023, 07:38 PM Last Post: snippsat
	Open and read multiple text files and match words	kozaizsvemira	3	6,747	Jul-07-2021, 11:27 AM Last Post: Larz60+
	Look for match in two files and print out in the first file	Batistuta	0	1,588	Mar-03-2020, 02:27 PM Last Post: Batistuta
	Compare two large CSV files for a match	Python_Newbie9	3	5,799	Apr-22-2019, 08:49 PM Last Post: ichabod801
	Match CSV files for difference	Cuz	4	3,531	Dec-18-2018, 02:16 PM Last Post: Cuz

How to match two CSV files

User Panel Messages

Announcements