Python Forum

I have a lot of csv files with data and about 30 possible known headers, Header1 to Header30. The number of rows can be practically anything, from 1 to thousands. Not all the headers are present in each file. Seperator is a semicolon.

An example is Header1;Header3;Header8;Header10;Header11;Header17;Header18

Some of the headers have to be combined to one with a new headername.

The columns with Header1 to Header3, if one of them exists, should always be combined in column NewHeaderA devided by spaces.
The column with Header4 becomes column NewHeaderB
The column with Header5 becomes column NewHeaderC
The columns with Header6 to Header14, if one of them exists, should always be combined in column NewHeaderD devided by spaces.
And so on in more combinations for the rest of the headers.

So far I've made a script that reads a csv and writes out another file with all the same headers or part of the existing headers in a different order. After that I'm stuck. Some help would be appreciated.

Due to some system limitations, pandas or numpy are not installed. Just plain Python 3.7.6

import sys
inFile = sys.argv[1]
outFile = sys.argv[2]

# open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

# open output file
outfile = open(outFile, "w" )

# fieldnames=reader.fieldnames  --> uses all columns
fieldnames = ["Header1","Header4","Header5","Header8","Header19","test"] # --> uses the specific columns in the order provided, test becomes an emty column

# write the output to a file
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {} 
for n in writer.fieldnames:
    headers[n] = n
writer.writerow(headers)
for row in reader:
    writer.writerow(row)

csvfile.close()
outfile.close()

What you could do is manipulate the rows at input by defining a 'merging plan' depending on the input headers. For example suppose that the input file has the headers ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']. Then the merging plan would be: combine header2 and header3 to make a column newheaderA, change the column header5 into a column newheaderC, leave the spam column unchanged and combine header10 and header13 into a newheaderD. This plan can be represented by the python list

[('newheaderA', ['header2', 'header3']), ('newheaderC', ['header5']), ('spam', ['spam']), ('newheaderD', ['header10', 'header13'])]

.

The following code shows how one can automatically compute the merging plan from the input headers and how one can transform the input rows according to this plan. After that you can output the new rows as if they were the actual input rows, which you already know how to do.

I'm using the function more_itertools.unique_everseen(). If you don't want to import more_itertools, you can simply copy the implementation of unique_everseen that is given at the end of the official documentation page of module itertools.

from more_itertools import unique_everseen

rules = [
    ('newheaderA', ['header1', 'header2', 'header3']),
    ('newheaderB', ['header4']),
    ('newheaderC', ['header5']),
    ('newheaderD', ['header6', 'header7', 'header8',
                    'header9', 'header10', 'header11',
                    'header12', 'header13', 'header14']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan

def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}

def main():
    # compute the merging plan for a given sequence of input headers
    headers = ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']
    plan = merging_plan(headers)
    print(plan)
    
    # compute the merged row corresponding to an input row
    r = {'header2': 'v2', 'header5': 'v5',
         'header3': 'v3', 'spam': 'vspam',
         'header10': 'v10', 'header13': 'v13',}
    print(merge(plan, r))

if __name__ == '__main__':
    main()

Output:[('newheaderA', ['header2', 'header3']), ('newheaderC', ['header5']), ('spam', ['spam']), ('newheaderD', ['header10', 'header13'])]
{'spam': 'vspam', 'newheaderD': 'v10 v13', 'newheaderC': 'v5', 'newheaderA': 'v2 v3'}

Hello Gribouillis,
Thanks for your reply, this helped a lot already. Im getting the idea of the replacement plan, exactly what I meant.
I edited my code with yours, and it now reads the headers of my input file and puts them in the plan.
I can't get the merging part to work. De merged columns should be written to the outputfile
(Yes, I'm a python beginner :) maybe to difficult things to start with, but I learn the most of the practice)
On rule 70: shouldn't that be a for loop, to run through all the rows?
On rule 83: I think that the merging should be done here, but how.
Code after this is unchanged

import csv
import sys

#===NEW========================
from more_itertools import unique_everseen
 
rules = [
    ('_A-REFERENTIE_', ['_01-REFERENTIEAANDUIDING_', '_02-REFERENTIE1_', '_03-REFERENTIE2_']),
    ('_B-ORGANISATIE_', ['_04-ORGANISATIE_']),
    ('_C-AFDELING1_', ['_05-AFDELING1_']),
    ('_D-AFDELING2_', ['_06-AFDELING2_']),
    ('_E-COMPLEETNAAM_', ['_07-AANHEF_', '_08-TITELVOOR_', '_09-VOORLETTERS_', '_10-VOORNAAM_', '_11-TUSSENVOEGSEL_', '_12-ACHTERNAAM1_', '_13-ACHTERNAAM2_', '_15-COMPLEETNAAM_', '_14-TITELNA_']),
    ('_F-COMPLEETADRES_', ['_19-COMPLEETADRES1_', '_16-STRAAT_', '_17-HUISNR_', '_18-HUISNRTV_']),
    ('_G-COMPLEETADRES2_', ['_20-COMPLEETADRES2_']),
    ('_H-COMPLEETPOSTPLAATS_', ['_21-POSTCODE_', '_22-POSTPLAATS_', '_23-POSTCODE2_', '_25-COMPLEETPOSTPLAATS_', '_24-LANDREGIO_']),
    ('_I-LAND_', ['_26-LAND_']),
    ('_J-KIX_', ['_28-KIX_']),
    ('_K-AANTAL_', ['_29-AANTAL_']),
    ('_L-VOLGNR_', ['_30-VOLGNR_']),
    ('_M-BUNDEL_', ['_31-BUNDEL_']),
    ('_N-BUNDELNR_', ['_32-BUNDELNR_']),
    ('_O-PALLET_', ['_33-PALLET_']),
    ('_P-PALLNR_', ['_34-PALLNR_']),
    ('_Q-SANDDCODE_', ['_35-SANDDCODE_']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan
 
def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}
 
#===NEW_END========================


#inFile = sys.argv[1]
inFile = 'AdressenTest.csv'
#outFile = sys.argv[2]
outFile = 'AdressenTestOut.csv'

### open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

#===NEW========================
def main():
    ### compute the merging plan for a given sequence of input headers
    headers = reader.fieldnames
    # headers = ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']
    # headers = ['_04-ORGANISATIE_', '_15-COMPLEETNAAM_', '_19-COMPLEETADRES1_', '_17-HUISNR_', '_21-POSTCODE_', '_22-POSTPLAATS_', '_36-NVT_', '_30-VOLGNR_', '_35-SANDDCODE_', '_31-BUNDEL_', '_32-BUNDELNR_', '_33-PALLET_', '_34-PALLNR_']
    plan = merging_plan(headers)
    print(plan)
     
    ### compute the merged row corresponding to an input row
    '''r = {'header2': 'v2', 'header5': 'v5',
         'header3': 'v3', 'spam': 'vspam',
         'header10': 'v10', 'header13': 'v13',}
    print(merge(plan, r))'''
 
if __name__ == '__main__':
    main()
#===NEW_END========================

### open output file
outfile = open(outFile, "w" )
# get a csv writer --  concat(nameFirst, nameLast) as full_name 

fieldnames=reader.fieldnames
# fieldnames = ["SD_VOLGNR","voorletters","achternaam","NewHeaderA","SD_SNDCODE"]


#writer = csv.DictWriter(outfile, delimiter=';', fieldnames=reader.fieldnames)
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {} 
for n in writer.fieldnames:
    headers[n] = n
writer.writerow(headers)
for row in reader:
    writer.writerow(row)

csvfile.close()
outfile.close()

You need to translate the rows according to the plan, so replace the lines 77 to 88 by

fieldnames = [pair[0] for pair in plan]
 
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {n: n for n in writer.fieldnames}
writer.writerow(headers)
for row in reader:
    new_row = merge(plan, row)
    writer.writerow(new_row)

Quote: File "/Users/jb/Studie/Python/AdresFileNormalizer5.py", line 79, in <module>
fieldnames = [pair[0] for pair in plan]
NameError: name 'plan' is not defined

Try to understand the code, not simply copy and paste. The plan must be created with

plan = merging_plan(reader.fieldnames)

Sorry, my bad! I tried in a hurry to change my script while being in a busy working environment.
I went out for a while to an emtpy office to concentrate, and now it is working.
I also altered the code to work as an oneliner in the terminal. Tested with about 100 different files and it works for all.
Thanks again. Below my resulting code

import csv
import sys
from more_itertools import unique_everseen
 
rules = [
    ('_A-REFERENTIE_', ['_01-REFERENTIEAANDUIDING_', '_02-REFERENTIE1_', '_03-REFERENTIE2_']),
    ('_B-ORGANISATIE_', ['_04-ORGANISATIE_']),
    ('_C-AFDELING1_', ['_05-AFDELING1_']),
    ('_D-AFDELING2_', ['_06-AFDELING2_']),
    ('_E-COMPLEETNAAM_', ['_07-AANHEF_', '_08-TITELVOOR_', '_09-VOORLETTERS_', '_10-VOORNAAM_', '_11-TUSSENVOEGSEL_', '_12-ACHTERNAAM1_', '_13-ACHTERNAAM2_', '_15-COMPLEETNAAM_', '_14-TITELNA_']),
    ('_F-COMPLEETADRES_', ['_19-COMPLEETADRES1_', '_16-STRAAT_', '_17-HUISNR_', '_18-HUISNRTV_']),
    ('_G-COMPLEETADRES2_', ['_20-COMPLEETADRES2_']),
    ('_H-COMPLEETPOSTPLAATS_', ['_21-POSTCODE_', '_22-POSTPLAATS_', '_23-POSTCODE2_', '_25-COMPLEETPOSTPLAATS_', '_24-LANDREGIO_']),
    ('_I-LAND_', ['_26-LAND_']),
    ('_J-KIX_', ['_28-KIX_']),
    ('_K-AANTAL_', ['_29-AANTAL_']),
    ('_L-VOLGNR_', ['_30-VOLGNR_']),
    ('_M-BUNDEL_', ['_31-BUNDEL_']),
    ('_N-BUNDELNR_', ['_32-BUNDELNR_']),
    ('_O-PALLET_', ['_33-PALLET_']),
    ('_P-PALLNR_', ['_34-PALLNR_']),
    ('_Q-SANDDCODE_', ['_35-SANDDCODE_']),
    ('_R-NVT_', ['_36-NVT_']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan
 
def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}
 
inFile = sys.argv[1]

### open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

def main():
    ### compute the merging plan for a given sequence of input headers
    headers = reader.fieldnames
    plan = merging_plan(headers)
     
if __name__ == '__main__':
    main()

plan = merging_plan(reader.fieldnames)
fieldnames = [pair[0] for pair in plan]

outFile = sys.argv[2]
outfile = open(outFile, "w" )

writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {n: n for n in writer.fieldnames}
writer.writerow(headers)
for row in reader:
    new_row = merge(plan, row)
    writer.writerow(new_row)

csvfile.close()
outfile.close()

jbveenstra

Gribouillis

jbveenstra

Gribouillis

jbveenstra

Gribouillis

jbveenstra