Python script merging some columns to one column with new name

jbveenstra · Feb-11-2020, 09:15 PM

I have a lot of csv files with data and about 30 possible known headers, Header1 to Header30. The number of rows can be practically anything, from 1 to thousands. Not all the headers are present in each file. Seperator is a semicolon.

An example is Header1;Header3;Header8;Header10;Header11;Header17;Header18

Some of the headers have to be combined to one with a new headername.

The columns with Header1 to Header3, if one of them exists, should always be combined in column NewHeaderA devided by spaces.
The column with Header4 becomes column NewHeaderB
The column with Header5 becomes column NewHeaderC
The columns with Header6 to Header14, if one of them exists, should always be combined in column NewHeaderD devided by spaces.
And so on in more combinations for the rest of the headers.

So far I've made a script that reads a csv and writes out another file with all the same headers or part of the existing headers in a different order. After that I'm stuck. Some help would be appreciated.

Due to some system limitations, pandas or numpy are not installed. Just plain Python 3.7.6

import sys
inFile = sys.argv[1]
outFile = sys.argv[2]

# open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

# open output file
outfile = open(outFile, "w" )

# fieldnames=reader.fieldnames  --> uses all columns
fieldnames = ["Header1","Header4","Header5","Header8","Header19","test"] # --> uses the specific columns in the order provided, test becomes an emty column

# write the output to a file
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {} 
for n in writer.fieldnames:
    headers[n] = n
writer.writerow(headers)
for row in reader:
    writer.writerow(row)

csvfile.close()
outfile.close()

**Gribouillis** · (This post was last modified: Feb-11-2020, 11:31 PM by Gribouillis.)

What you could do is manipulate the rows at input by defining a 'merging plan' depending on the input headers. For example suppose that the input file has the headers ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']. Then the merging plan would be: combine header2 and header3 to make a column newheaderA, change the column header5 into a column newheaderC, leave the spam column unchanged and combine header10 and header13 into a newheaderD. This plan can be represented by the python list

[('newheaderA', ['header2', 'header3']), ('newheaderC', ['header5']), ('spam', ['spam']), ('newheaderD', ['header10', 'header13'])]

.

The following code shows how one can automatically compute the merging plan from the input headers and how one can transform the input rows according to this plan. After that you can output the new rows as if they were the actual input rows, which you already know how to do.

I'm using the function more_itertools.unique_everseen(). If you don't want to import more_itertools, you can simply copy the implementation of unique_everseen that is given at the end of the official documentation page of module itertools.

from more_itertools import unique_everseen

rules = [
    ('newheaderA', ['header1', 'header2', 'header3']),
    ('newheaderB', ['header4']),
    ('newheaderC', ['header5']),
    ('newheaderD', ['header6', 'header7', 'header8',
                    'header9', 'header10', 'header11',
                    'header12', 'header13', 'header14']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan

def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}

def main():
    # compute the merging plan for a given sequence of input headers
    headers = ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']
    plan = merging_plan(headers)
    print(plan)
    
    # compute the merged row corresponding to an input row
    r = {'header2': 'v2', 'header5': 'v5',
         'header3': 'v3', 'spam': 'vspam',
         'header10': 'v10', 'header13': 'v13',}
    print(merge(plan, r))

if __name__ == '__main__':
    main()

Output:[('newheaderA', ['header2', 'header3']), ('newheaderC', ['header5']), ('spam', ['spam']), ('newheaderD', ['header10', 'header13'])]
{'spam': 'vspam', 'newheaderD': 'v10 v13', 'newheaderC': 'v5', 'newheaderA': 'v2 v3'}

jbveenstra · (This post was last modified: Feb-13-2020, 09:57 AM by jbveenstra.)

Hello Gribouillis,
Thanks for your reply, this helped a lot already. Im getting the idea of the replacement plan, exactly what I meant.
I edited my code with yours, and it now reads the headers of my input file and puts them in the plan.
I can't get the merging part to work. De merged columns should be written to the outputfile
(Yes, I'm a python beginner :) maybe to difficult things to start with, but I learn the most of the practice)
On rule 70: shouldn't that be a for loop, to run through all the rows?
On rule 83: I think that the merging should be done here, but how.
Code after this is unchanged

import csv
import sys

#===NEW========================
from more_itertools import unique_everseen
 
rules = [
    ('_A-REFERENTIE_', ['_01-REFERENTIEAANDUIDING_', '_02-REFERENTIE1_', '_03-REFERENTIE2_']),
    ('_B-ORGANISATIE_', ['_04-ORGANISATIE_']),
    ('_C-AFDELING1_', ['_05-AFDELING1_']),
    ('_D-AFDELING2_', ['_06-AFDELING2_']),
    ('_E-COMPLEETNAAM_', ['_07-AANHEF_', '_08-TITELVOOR_', '_09-VOORLETTERS_', '_10-VOORNAAM_', '_11-TUSSENVOEGSEL_', '_12-ACHTERNAAM1_', '_13-ACHTERNAAM2_', '_15-COMPLEETNAAM_', '_14-TITELNA_']),
    ('_F-COMPLEETADRES_', ['_19-COMPLEETADRES1_', '_16-STRAAT_', '_17-HUISNR_', '_18-HUISNRTV_']),
    ('_G-COMPLEETADRES2_', ['_20-COMPLEETADRES2_']),
    ('_H-COMPLEETPOSTPLAATS_', ['_21-POSTCODE_', '_22-POSTPLAATS_', '_23-POSTCODE2_', '_25-COMPLEETPOSTPLAATS_', '_24-LANDREGIO_']),
    ('_I-LAND_', ['_26-LAND_']),
    ('_J-KIX_', ['_28-KIX_']),
    ('_K-AANTAL_', ['_29-AANTAL_']),
    ('_L-VOLGNR_', ['_30-VOLGNR_']),
    ('_M-BUNDEL_', ['_31-BUNDEL_']),
    ('_N-BUNDELNR_', ['_32-BUNDELNR_']),
    ('_O-PALLET_', ['_33-PALLET_']),
    ('_P-PALLNR_', ['_34-PALLNR_']),
    ('_Q-SANDDCODE_', ['_35-SANDDCODE_']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan
 
def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}
 
#===NEW_END========================


#inFile = sys.argv[1]
inFile = 'AdressenTest.csv'
#outFile = sys.argv[2]
outFile = 'AdressenTestOut.csv'

### open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

#===NEW========================
def main():
    ### compute the merging plan for a given sequence of input headers
    headers = reader.fieldnames
    # headers = ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']
    # headers = ['_04-ORGANISATIE_', '_15-COMPLEETNAAM_', '_19-COMPLEETADRES1_', '_17-HUISNR_', '_21-POSTCODE_', '_22-POSTPLAATS_', '_36-NVT_', '_30-VOLGNR_', '_35-SANDDCODE_', '_31-BUNDEL_', '_32-BUNDELNR_', '_33-PALLET_', '_34-PALLNR_']
    plan = merging_plan(headers)
    print(plan)
     
    ### compute the merged row corresponding to an input row
    '''r = {'header2': 'v2', 'header5': 'v5',
         'header3': 'v3', 'spam': 'vspam',
         'header10': 'v10', 'header13': 'v13',}
    print(merge(plan, r))'''
 
if __name__ == '__main__':
    main()
#===NEW_END========================

### open output file
outfile = open(outFile, "w" )
# get a csv writer --  concat(nameFirst, nameLast) as full_name 

fieldnames=reader.fieldnames
# fieldnames = ["SD_VOLGNR","voorletters","achternaam","NewHeaderA","SD_SNDCODE"]


#writer = csv.DictWriter(outfile, delimiter=';', fieldnames=reader.fieldnames)
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {} 
for n in writer.fieldnames:
    headers[n] = n
writer.writerow(headers)
for row in reader:
    writer.writerow(row)

csvfile.close()
outfile.close()

**Gribouillis** · Feb-13-2020, 10:31 AM

You need to translate the rows according to the plan, so replace the lines 77 to 88 by

fieldnames = [pair[0] for pair in plan]
 
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {n: n for n in writer.fieldnames}
writer.writerow(headers)
for row in reader:
    new_row = merge(plan, row)
    writer.writerow(new_row)

jbveenstra · Feb-13-2020, 10:58 AM

Quote: File "/Users/jb/Studie/Python/AdresFileNormalizer5.py", line 79, in <module>
fieldnames = [pair[0] for pair in plan]
NameError: name 'plan' is not defined

**Gribouillis** · (This post was last modified: Feb-13-2020, 11:44 AM by Gribouillis.)

Try to understand the code, not simply copy and paste. The plan must be created with

plan = merging_plan(reader.fieldnames)

jbveenstra · Feb-13-2020, 02:01 PM

Sorry, my bad! I tried in a hurry to change my script while being in a busy working environment.
I went out for a while to an emtpy office to concentrate, and now it is working.
I also altered the code to work as an oneliner in the terminal. Tested with about 100 different files and it works for all.
Thanks again. Below my resulting code

import csv
import sys
from more_itertools import unique_everseen
 
rules = [
    ('_A-REFERENTIE_', ['_01-REFERENTIEAANDUIDING_', '_02-REFERENTIE1_', '_03-REFERENTIE2_']),
    ('_B-ORGANISATIE_', ['_04-ORGANISATIE_']),
    ('_C-AFDELING1_', ['_05-AFDELING1_']),
    ('_D-AFDELING2_', ['_06-AFDELING2_']),
    ('_E-COMPLEETNAAM_', ['_07-AANHEF_', '_08-TITELVOOR_', '_09-VOORLETTERS_', '_10-VOORNAAM_', '_11-TUSSENVOEGSEL_', '_12-ACHTERNAAM1_', '_13-ACHTERNAAM2_', '_15-COMPLEETNAAM_', '_14-TITELNA_']),
    ('_F-COMPLEETADRES_', ['_19-COMPLEETADRES1_', '_16-STRAAT_', '_17-HUISNR_', '_18-HUISNRTV_']),
    ('_G-COMPLEETADRES2_', ['_20-COMPLEETADRES2_']),
    ('_H-COMPLEETPOSTPLAATS_', ['_21-POSTCODE_', '_22-POSTPLAATS_', '_23-POSTCODE2_', '_25-COMPLEETPOSTPLAATS_', '_24-LANDREGIO_']),
    ('_I-LAND_', ['_26-LAND_']),
    ('_J-KIX_', ['_28-KIX_']),
    ('_K-AANTAL_', ['_29-AANTAL_']),
    ('_L-VOLGNR_', ['_30-VOLGNR_']),
    ('_M-BUNDEL_', ['_31-BUNDEL_']),
    ('_N-BUNDELNR_', ['_32-BUNDELNR_']),
    ('_O-PALLET_', ['_33-PALLET_']),
    ('_P-PALLNR_', ['_34-PALLNR_']),
    ('_Q-SANDDCODE_', ['_35-SANDDCODE_']),
    ('_R-NVT_', ['_36-NVT_']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan
 
def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}
 
inFile = sys.argv[1]

### open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

def main():
    ### compute the merging plan for a given sequence of input headers
    headers = reader.fieldnames
    plan = merging_plan(headers)
     
if __name__ == '__main__':
    main()

plan = merging_plan(reader.fieldnames)
fieldnames = [pair[0] for pair in plan]

outFile = sys.argv[2]
outfile = open(outFile, "w" )

writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {n: n for n in writer.fieldnames}
writer.writerow(headers)
for row in reader:
    new_row = merge(plan, row)
    writer.writerow(new_row)

csvfile.close()
outfile.close()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Converting column of values into muliple columns of counts	highland44	0	251	Feb-01-2024, 12:48 AM Last Post: highland44
	Is there a .bat DOS batch script to .py Python Script converter?	pstein	3	3,177	Jun-29-2023, 11:57 AM Last Post: gologica
	J2534 Python Can Bus merging	natezoom	0	708	May-01-2023, 10:37 PM Last Post: natezoom
	Reshaping a single column in to multiple column using Python	sahar	7	2,041	Jun-20-2022, 12:35 PM Last Post: deanhystad
	df column aggregate and group by multiple columns	SriRajesh	0	1,034	May-06-2022, 02:26 PM Last Post: SriRajesh
	Transform 3 Columns into Single Column	DaveG	8	1,862	Apr-04-2022, 08:42 AM Last Post: Pedroski55
	Split single column to multiple columns	SriRajesh	1	1,319	Jan-07-2022, 06:43 PM Last Post: jefsummers
	How to remove a column or two columns in a correlation heatmap?	lulu43366	3	5,196	Sep-30-2021, 03:47 PM Last Post: lulu43366
	Merging spreadsheets with the same columns and extracting rows with matching entries	johnbernard	3	9,377	Aug-19-2021, 03:08 PM Last Post: johnbernard
	Index error - columns vs non-column	Vinny	3	4,910	Aug-09-2021, 04:46 PM Last Post: snippsat

Python script merging some columns to one column with new name

User Panel Messages

Announcements