![]() |
How to match two CSV files - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: How to match two CSV files (/thread-21417.html) |
How to match two CSV files - timlamont - Sep-29-2019 I'm new to Python from VisualBasic, so excuse my basic question. I am trying to match two CSV files, based on the data in columns P1-P5: CSV#1: Header Row1 = DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5 Row2 = data1, data2, data3, 1,2,3,4,5 etc... CSV#2: Header Row1 = ColName1, ColName2, ColName3, P1, P2, P3, P4, P5 Row2 = ea1, ea2, ea3, 3,5,6,2,1 Row3 = db1, db2, db3, 1,2,3,4,5 etc... I would like: 1) To match each row in CSV1 against it's corresponding data in columns P1-P5 in CSV2 2) Combined these matched rows to form a new row in a new file, CSV#3: eg:- CSV#3: Header Row1 = DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5, ColName1, ColName2, ColName3, P1, P2, P3, P4, P5 Row1 = data1, data2, data3, 1,2,3,4,5, db1, db2, db3, 1,2,3,4,5 etc... I am lost how to do this! Is there a way to do this as quick as possible, bearing in mind the CSV files are large (sometimes up to 99999 rows). I found a module called "pandas" but it is beyond my current understanding. Thank you very much :) RE: How to match two CSV files - Axel_Erfurt - Sep-29-2019 read your files with pandas and then compare df1 and df2 pandas.DataFrame.equals import pandas as pd csv1 = "/path/to/csv1.csv" csv2 = "/path/to/csv2.csv" f1 = open(csv1, 'r') with f1: df1 = pd.read_csv(f1, delimiter = ',', header=None) f2 = open(csv2, 'r') with f2: df2 = pd.read_csv(f2, delimiter = ',', header=None) or use from pandas.testing import assert_frame_equal c = assert_frame_equal(df1, df2, check_dtype=False) print(c) RE: How to match two CSV files - timlamont - Sep-29-2019 Thanks for your reply. I used the code as mentioned (Python 2.7.15+ on ubuntu) and got this:
BTW - I'm not tied to using Pandas. If there is another simpler way, that's fine RE: How to match two CSV files - perfringo - Sep-29-2019 Do you want match by row numbers i.e first data row in first file with first data row in second file and so on? RE: How to match two CSV files - timlamont - Sep-29-2019 Match P1-P5 in the first row in CSV1 to P1-P5 in *any* row in CSV2 Then: Match P1-P5 in the second row in CSV1 to P1-P5 in *any* row in CSV2 ...and so on until all rows in CSV1 have been gone thru RE: How to match two CSV files - perfringo - Sep-30-2019 Are the column values in one row unique? Are all rows unique within one file? EDIT: I see two relatively simple approaches: column values as dictionary keys or set. But in order to evaluate whether these datastructures are applicable one must understand the data in files. RE: How to match two CSV files - buran - Sep-30-2019 An alternative approach - upload both files in database (two tables) and query with JOIN ON the respective columns that should be equal to match records RE: How to match two CSV files - DeaD_EyE - Sep-30-2019 I made at home an example, but it's over complicated for you. I guess you won't understand the half of the code. One thought afterwards was, to use the DictReader and merge the OrderedDictes per row of each file. Then duplicated keys will disappear. My first approach was to use a mask to select the fields for each file to merge. After work I'll add the code. RE: How to match two CSV files - DeaD_EyE - Oct-01-2019 import io # fake file import sys import csv from contextlib import ExitStack from itertools import ( chain, compress, repeat, ) def combine_rows(rows, masks): return tuple( chain.from_iterable( compress(row, mask) for row, mask in zip(rows, masks) ) ) def automatic_mask_gen(headers): iterator = iter(headers) selected = [*next(iterator)] masks = [tuple(repeat(1, len(selected[0])))] for header in iterator: mask = [] for topic in header: if topic in selected: mask.append(0) else: mask.append(1) selected.append(topic) masks.append(tuple(mask)) return masks def automatic_sort_gen(flat_cols): return tuple(flat_cols.index(fls) for fls in sorted(flat_cols)) def sort_with_mask(cols, mask): return tuple(cols[idx] for idx in mask) def strip_whitespace(rows): return [[col.strip() for col in cols] for cols in rows] def combine_same_fields(csv_files, masks=None, delimiter=',', strip=True, autosort=True): if masks is not None and len(csv_files) != len(masks): raise ValueError('csv_files must be the same amout as masks.') with ExitStack() as stack: fds = [ stack.enter_context(open(file)) if not isinstance(file, io.StringIO) else file for file in csv_files ] csv_iterators = [csv.reader(fd, delimiter=delimiter) for fd in fds] try: headers = [next(iterator) for iterator in csv_iterators] except StopIteration: raise Exception('Not able to get the header of one of the csv-files', file=sys.stdout) if strip: headers = strip_whitespace(headers) if masks is None: masks = automatic_mask_gen(headers) if autosort: combined_headers = combine_rows(headers, masks) sort_mask = automatic_sort_gen(combined_headers) yield sort_with_mask(combined_headers, sort_mask) else: yield combine_rows(headers, masks) for rows in zip(*csv_iterators): if strip: rows = strip_whitespace(rows) if autosort: yield sort_with_mask(combine_rows(rows, masks), sort_mask) else: yield combine_rows(rows, masks) CSV_1 = io.StringIO(""" DataCol1, DataCol2, DataCol3, P1, P2, P3, P4, P5 data1, data2, data3, 1,2,3,4,5 """.strip()) # fake file 1 CSV_2 = io.StringIO(""" ColName1, ColName2, ColName3, P1, P2, P3, P4, P5 ea1, ea2, ea3, 3,5,6,2,1 db1, db2, db3, 1,2,3,4,5 """.strip()) # fake file 2 # you can have more # mask for 2 csv data files masks = [(1,1,1, 1,1,1,1,1), (1,1,1, 0,0,0,0,0)] # this is the resulting generator object iterator = combine_same_fields([CSV_1, CSV_2], masks=None) # open a output file in write mode, create the csv.writer # object and iterate over the iterator which is the generator. # For each combined row yielded from iterator, # is written by the writer object with the method writerow with open('output.csv', 'w') as fd: writer = csv.writer(fd, delimiter=',') for row in iterator: writer.writerow(row) # done # show stored data # small example with dict csv reader with open('output.csv') as fd: print(fd.read()) fd.seek(0) # back to start of file # example with dict reader reader = csv.DictReader(fd) print('Fields of reader:', reader.fieldnames) for row in reader: print(row)Reading my own code days later ![]() ![]() ![]() RE: How to match two CSV files - timlamont - Oct-01-2019 Hey DeaD_EyE , thanks for that code. I shall give it a go :) |