Posts: 4
Threads: 1
Joined: Sep 2019
I'm new to Python from VisualBasic, so excuse my basic question.
I am trying to match two CSV files, based on the data in columns P1-P5:
CSV#1:
Header Row1 = DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5
Row2 = data1, data2, data3, 1,2,3,4,5
etc...
CSV#2:
Header Row1 = ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
Row2 = ea1, ea2, ea3, 3,5,6,2,1
Row3 = db1, db2, db3, 1,2,3,4,5
etc...
I would like:
1) To match each row in CSV1 against it's corresponding data in columns P1-P5 in CSV2
2) Combined these matched rows to form a new row in a new file, CSV#3:
eg:-
CSV#3:
Header Row1 = DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5, ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
Row1 = data1, data2, data3, 1,2,3,4,5, db1, db2, db3, 1,2,3,4,5
etc...
I am lost how to do this!
Is there a way to do this as quick as possible, bearing in mind the CSV files are large (sometimes up to 99999 rows).
I found a module called "pandas" but it is beyond my current understanding.
Thank you very much :)
Posts: 1,023
Threads: 16
Joined: Dec 2016
Sep-29-2019, 09:24 AM
(This post was last modified: Sep-29-2019, 09:29 AM by Axel_Erfurt.)
read your files with pandas and then compare df1 and df2
pandas.DataFrame.equals
import pandas as pd
csv1 = "/path/to/csv1.csv"
csv2 = "/path/to/csv2.csv"
f1 = open(csv1, 'r')
with f1:
df1 = pd.read_csv(f1, delimiter = ',', header=None)
f2 = open(csv2, 'r')
with f2:
df2 = pd.read_csv(f2, delimiter = ',', header=None)
or use
from pandas.testing import assert_frame_equal
c = assert_frame_equal(df1, df2, check_dtype=False)
print(c)
Posts: 4
Threads: 1
Joined: Sep 2019
Sep-29-2019, 09:53 AM
(This post was last modified: Sep-29-2019, 10:01 AM by timlamont.)
Thanks for your reply.
I used the code as mentioned (Python 2.7.15+ on ubuntu) and got this:
Error: Traceback (most recent call last):
File "a.py", line 30, in <module>
c = assert_frame_equal(df1, df2, check_dtype=False)
File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1502, in assert_frame_equal
obj='DataFrame.iloc[:, {idx}]'.format(idx=i))
File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1336, in assert_series_equal
obj='{obj}'.format(obj=obj))
File "pandas/_libs/testing.pyx", line 66, in pandas._libs.testing.assert_almost_equal
File "pandas/_libs/testing.pyx", line 180, in pandas._libs.testing.assert_almost_equal
File "/usr/local/lib/python2.7/dist-packages/pandas/util/testing.py", line 1094, in raise_assert_detail
raise AssertionError(msg)
AssertionError: DataFrame.iloc[:, 0] are different
DataFrame.iloc[:, 0] values are different (100.0 %)
[left]: [DataCol1, data1, data1]
[right]: [ColName1, ea1, db1]
BTW - I'm not tied to using Pandas. If there is another simpler way, that's fine
Posts: 1,950
Threads: 8
Joined: Jun 2018
Do you want match by row numbers i.e first data row in first file with first data row in second file and so on?
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 4
Threads: 1
Joined: Sep 2019
Match P1-P5 in the first row in CSV1 to P1-P5 in *any* row in CSV2
Then:
Match P1-P5 in the second row in CSV1 to P1-P5 in *any* row in CSV2
...and so on until all rows in CSV1 have been gone thru
Posts: 1,950
Threads: 8
Joined: Jun 2018
Sep-30-2019, 05:02 AM
(This post was last modified: Sep-30-2019, 05:02 AM by perfringo.)
Are the column values in one row unique? Are all rows unique within one file?
EDIT: I see two relatively simple approaches: column values as dictionary keys or set. But in order to evaluate whether these datastructures are applicable one must understand the data in files.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 8,151
Threads: 160
Joined: Sep 2016
Sep-30-2019, 05:46 AM
(This post was last modified: Sep-30-2019, 05:47 AM by buran.)
An alternative approach - upload both files in database (two tables) and query with JOIN ON the respective columns that should be equal to match records
Posts: 2,120
Threads: 10
Joined: May 2017
I made at home an example, but it's over complicated for you.
I guess you won't understand the half of the code.
One thought afterwards was, to use the DictReader and merge the OrderedDictes per row of each file.
Then duplicated keys will disappear.
My first approach was to use a mask to select the fields for each file to merge.
After work I'll add the code.
Posts: 2,120
Threads: 10
Joined: May 2017
Oct-01-2019, 05:11 PM
(This post was last modified: Oct-01-2019, 05:11 PM by DeaD_EyE.)
import io # fake file
import sys
import csv
from contextlib import ExitStack
from itertools import (
chain,
compress,
repeat,
)
def combine_rows(rows, masks):
return tuple(
chain.from_iterable(
compress(row, mask)
for row, mask
in zip(rows, masks)
)
)
def automatic_mask_gen(headers):
iterator = iter(headers)
selected = [*next(iterator)]
masks = [tuple(repeat(1, len(selected[0])))]
for header in iterator:
mask = []
for topic in header:
if topic in selected:
mask.append(0)
else:
mask.append(1)
selected.append(topic)
masks.append(tuple(mask))
return masks
def automatic_sort_gen(flat_cols):
return tuple(flat_cols.index(fls) for fls in sorted(flat_cols))
def sort_with_mask(cols, mask):
return tuple(cols[idx] for idx in mask)
def strip_whitespace(rows):
return [[col.strip() for col in cols] for cols in rows]
def combine_same_fields(csv_files, masks=None, delimiter=',', strip=True, autosort=True):
if masks is not None and len(csv_files) != len(masks):
raise ValueError('csv_files must be the same amout as masks.')
with ExitStack() as stack:
fds = [
stack.enter_context(open(file)) if not isinstance(file, io.StringIO) else file
for file in csv_files
]
csv_iterators = [csv.reader(fd, delimiter=delimiter) for fd in fds]
try:
headers = [next(iterator) for iterator in csv_iterators]
except StopIteration:
raise Exception('Not able to get the header of one of the csv-files', file=sys.stdout)
if strip:
headers = strip_whitespace(headers)
if masks is None:
masks = automatic_mask_gen(headers)
if autosort:
combined_headers = combine_rows(headers, masks)
sort_mask = automatic_sort_gen(combined_headers)
yield sort_with_mask(combined_headers, sort_mask)
else:
yield combine_rows(headers, masks)
for rows in zip(*csv_iterators):
if strip:
rows = strip_whitespace(rows)
if autosort:
yield sort_with_mask(combine_rows(rows, masks), sort_mask)
else:
yield combine_rows(rows, masks)
CSV_1 = io.StringIO("""
DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5
data1, data2, data3, 1,2,3,4,5
""".strip()) # fake file 1
CSV_2 = io.StringIO("""
ColName1, ColName2, ColName3, P1, P2, P3, P4, P5
ea1, ea2, ea3, 3,5,6,2,1
db1, db2, db3, 1,2,3,4,5
""".strip()) # fake file 2
# you can have more
# mask for 2 csv data files
masks = [(1,1,1, 1,1,1,1,1), (1,1,1, 0,0,0,0,0)]
# this is the resulting generator object
iterator = combine_same_fields([CSV_1, CSV_2], masks=None)
# open a output file in write mode, create the csv.writer
# object and iterate over the iterator which is the generator.
# For each combined row yielded from iterator,
# is written by the writer object with the method writerow
with open('output.csv', 'w') as fd:
writer = csv.writer(fd, delimiter=',')
for row in iterator:
writer.writerow(row)
# done
# show stored data
# small example with dict csv reader
with open('output.csv') as fd:
print(fd.read())
fd.seek(0) # back to start of file
# example with dict reader
reader = csv.DictReader(fd)
print('Fields of reader:', reader.fieldnames)
for row in reader:
print(row) Reading my own code days later
Posts: 4
Threads: 1
Joined: Sep 2019
Hey DeaD_EyE , thanks for that code. I shall give it a go :)
|