Python Forum
Python script merging some columns to one column with new name
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python script merging some columns to one column with new name
#1
I have a lot of csv files with data and about 30 possible known headers, Header1 to Header30. The number of rows can be practically anything, from 1 to thousands. Not all the headers are present in each file. Seperator is a semicolon.

An example is Header1;Header3;Header8;Header10;Header11;Header17;Header18

Some of the headers have to be combined to one with a new headername.

The columns with Header1 to Header3, if one of them exists, should always be combined in column NewHeaderA devided by spaces.
The column with Header4 becomes column NewHeaderB
The column with Header5 becomes column NewHeaderC
The columns with Header6 to Header14, if one of them exists, should always be combined in column NewHeaderD devided by spaces.
And so on in more combinations for the rest of the headers.

So far I've made a script that reads a csv and writes out another file with all the same headers or part of the existing headers in a different order. After that I'm stuck. Some help would be appreciated.

Due to some system limitations, pandas or numpy are not installed. Just plain Python 3.7.6

import sys
inFile = sys.argv[1]
outFile = sys.argv[2]

# open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

# open output file
outfile = open(outFile, "w" )

# fieldnames=reader.fieldnames  --> uses all columns
fieldnames = ["Header1","Header4","Header5","Header8","Header19","test"] # --> uses the specific columns in the order provided, test becomes an emty column

# write the output to a file
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {} 
for n in writer.fieldnames:
    headers[n] = n
writer.writerow(headers)
for row in reader:
    writer.writerow(row)

csvfile.close()
outfile.close()
Reply
#2
What you could do is manipulate the rows at input by defining a 'merging plan' depending on the input headers. For example suppose that the input file has the headers ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']. Then the merging plan would be: combine header2 and header3 to make a column newheaderA, change the column header5 into a column newheaderC, leave the spam column unchanged and combine header10 and header13 into a newheaderD. This plan can be represented by the python list [('newheaderA', ['header2', 'header3']), ('newheaderC', ['header5']), ('spam', ['spam']), ('newheaderD', ['header10', 'header13'])].

The following code shows how one can automatically compute the merging plan from the input headers and how one can transform the input rows according to this plan. After that you can output the new rows as if they were the actual input rows, which you already know how to do.

I'm using the function more_itertools.unique_everseen(). If you don't want to import more_itertools, you can simply copy the implementation of unique_everseen that is given at the end of the official documentation page of module itertools.
from more_itertools import unique_everseen

rules = [
    ('newheaderA', ['header1', 'header2', 'header3']),
    ('newheaderB', ['header4']),
    ('newheaderC', ['header5']),
    ('newheaderD', ['header6', 'header7', 'header8',
                    'header9', 'header10', 'header11',
                    'header12', 'header13', 'header14']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan

def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}

def main():
    # compute the merging plan for a given sequence of input headers
    headers = ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']
    plan = merging_plan(headers)
    print(plan)
    
    # compute the merged row corresponding to an input row
    r = {'header2': 'v2', 'header5': 'v5',
         'header3': 'v3', 'spam': 'vspam',
         'header10': 'v10', 'header13': 'v13',}
    print(merge(plan, r))

if __name__ == '__main__':
    main()
Output:
[('newheaderA', ['header2', 'header3']), ('newheaderC', ['header5']), ('spam', ['spam']), ('newheaderD', ['header10', 'header13'])] {'spam': 'vspam', 'newheaderD': 'v10 v13', 'newheaderC': 'v5', 'newheaderA': 'v2 v3'}
Reply
#3
Hello Gribouillis,
Thanks for your reply, this helped a lot already. Im getting the idea of the replacement plan, exactly what I meant.
I edited my code with yours, and it now reads the headers of my input file and puts them in the plan.
I can't get the merging part to work. De merged columns should be written to the outputfile
(Yes, I'm a python beginner :) maybe to difficult things to start with, but I learn the most of the practice)
On rule 70: shouldn't that be a for loop, to run through all the rows?
On rule 83: I think that the merging should be done here, but how.
Code after this is unchanged

import csv
import sys

#===NEW========================
from more_itertools import unique_everseen
 
rules = [
    ('_A-REFERENTIE_', ['_01-REFERENTIEAANDUIDING_', '_02-REFERENTIE1_', '_03-REFERENTIE2_']),
    ('_B-ORGANISATIE_', ['_04-ORGANISATIE_']),
    ('_C-AFDELING1_', ['_05-AFDELING1_']),
    ('_D-AFDELING2_', ['_06-AFDELING2_']),
    ('_E-COMPLEETNAAM_', ['_07-AANHEF_', '_08-TITELVOOR_', '_09-VOORLETTERS_', '_10-VOORNAAM_', '_11-TUSSENVOEGSEL_', '_12-ACHTERNAAM1_', '_13-ACHTERNAAM2_', '_15-COMPLEETNAAM_', '_14-TITELNA_']),
    ('_F-COMPLEETADRES_', ['_19-COMPLEETADRES1_', '_16-STRAAT_', '_17-HUISNR_', '_18-HUISNRTV_']),
    ('_G-COMPLEETADRES2_', ['_20-COMPLEETADRES2_']),
    ('_H-COMPLEETPOSTPLAATS_', ['_21-POSTCODE_', '_22-POSTPLAATS_', '_23-POSTCODE2_', '_25-COMPLEETPOSTPLAATS_', '_24-LANDREGIO_']),
    ('_I-LAND_', ['_26-LAND_']),
    ('_J-KIX_', ['_28-KIX_']),
    ('_K-AANTAL_', ['_29-AANTAL_']),
    ('_L-VOLGNR_', ['_30-VOLGNR_']),
    ('_M-BUNDEL_', ['_31-BUNDEL_']),
    ('_N-BUNDELNR_', ['_32-BUNDELNR_']),
    ('_O-PALLET_', ['_33-PALLET_']),
    ('_P-PALLNR_', ['_34-PALLNR_']),
    ('_Q-SANDDCODE_', ['_35-SANDDCODE_']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan
 
def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}
 
#===NEW_END========================


#inFile = sys.argv[1]
inFile = 'AdressenTest.csv'
#outFile = sys.argv[2]
outFile = 'AdressenTestOut.csv'

### open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

#===NEW========================
def main():
    ### compute the merging plan for a given sequence of input headers
    headers = reader.fieldnames
    # headers = ['header2', 'header5', 'header3', 'spam', 'header10', 'header13']
    # headers = ['_04-ORGANISATIE_', '_15-COMPLEETNAAM_', '_19-COMPLEETADRES1_', '_17-HUISNR_', '_21-POSTCODE_', '_22-POSTPLAATS_', '_36-NVT_', '_30-VOLGNR_', '_35-SANDDCODE_', '_31-BUNDEL_', '_32-BUNDELNR_', '_33-PALLET_', '_34-PALLNR_']
    plan = merging_plan(headers)
    print(plan)
     
    ### compute the merged row corresponding to an input row
    '''r = {'header2': 'v2', 'header5': 'v5',
         'header3': 'v3', 'spam': 'vspam',
         'header10': 'v10', 'header13': 'v13',}
    print(merge(plan, r))'''
 
if __name__ == '__main__':
    main()
#===NEW_END========================

### open output file
outfile = open(outFile, "w" )
# get a csv writer --  concat(nameFirst, nameLast) as full_name 

fieldnames=reader.fieldnames
# fieldnames = ["SD_VOLGNR","voorletters","achternaam","NewHeaderA","SD_SNDCODE"]


#writer = csv.DictWriter(outfile, delimiter=';', fieldnames=reader.fieldnames)
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {} 
for n in writer.fieldnames:
    headers[n] = n
writer.writerow(headers)
for row in reader:
    writer.writerow(row)

csvfile.close()
outfile.close()
Reply
#4
You need to translate the rows according to the plan, so replace the lines 77 to 88 by
fieldnames = [pair[0] for pair in plan]
 
writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {n: n for n in writer.fieldnames}
writer.writerow(headers)
for row in reader:
    new_row = merge(plan, row)
    writer.writerow(new_row)
Reply
#5
Quote: File "/Users/jb/Studie/Python/AdresFileNormalizer5.py", line 79, in <module>
fieldnames = [pair[0] for pair in plan]
NameError: name 'plan' is not defined
Reply
#6
Try to understand the code, not simply copy and paste. The plan must be created with
plan = merging_plan(reader.fieldnames)
Reply
#7
Sorry, my bad! I tried in a hurry to change my script while being in a busy working environment.
I went out for a while to an emtpy office to concentrate, and now it is working.
I also altered the code to work as an oneliner in the terminal. Tested with about 100 different files and it works for all.
Thanks again. Below my resulting code
import csv
import sys
from more_itertools import unique_everseen
 
rules = [
    ('_A-REFERENTIE_', ['_01-REFERENTIEAANDUIDING_', '_02-REFERENTIE1_', '_03-REFERENTIE2_']),
    ('_B-ORGANISATIE_', ['_04-ORGANISATIE_']),
    ('_C-AFDELING1_', ['_05-AFDELING1_']),
    ('_D-AFDELING2_', ['_06-AFDELING2_']),
    ('_E-COMPLEETNAAM_', ['_07-AANHEF_', '_08-TITELVOOR_', '_09-VOORLETTERS_', '_10-VOORNAAM_', '_11-TUSSENVOEGSEL_', '_12-ACHTERNAAM1_', '_13-ACHTERNAAM2_', '_15-COMPLEETNAAM_', '_14-TITELNA_']),
    ('_F-COMPLEETADRES_', ['_19-COMPLEETADRES1_', '_16-STRAAT_', '_17-HUISNR_', '_18-HUISNRTV_']),
    ('_G-COMPLEETADRES2_', ['_20-COMPLEETADRES2_']),
    ('_H-COMPLEETPOSTPLAATS_', ['_21-POSTCODE_', '_22-POSTPLAATS_', '_23-POSTCODE2_', '_25-COMPLEETPOSTPLAATS_', '_24-LANDREGIO_']),
    ('_I-LAND_', ['_26-LAND_']),
    ('_J-KIX_', ['_28-KIX_']),
    ('_K-AANTAL_', ['_29-AANTAL_']),
    ('_L-VOLGNR_', ['_30-VOLGNR_']),
    ('_M-BUNDEL_', ['_31-BUNDEL_']),
    ('_N-BUNDELNR_', ['_32-BUNDELNR_']),
    ('_O-PALLET_', ['_33-PALLET_']),
    ('_P-PALLNR_', ['_34-PALLNR_']),
    ('_Q-SANDDCODE_', ['_35-SANDDCODE_']),
    ('_R-NVT_', ['_36-NVT_']),
]

inverse_rules = { old: new for new, olds in rules for old in olds}
drules = dict(rules)

def merging_plan(headers):
    headers = list(headers)
    news = list(unique_everseen(inverse_rules.get(h, h) for h in headers))
    s = set(headers)
    plan = []
    for new in news:
        plan.append((new, [old for old in drules.get(new, [new]) if old in s]))
    return plan
 
def merge(plan, row):
    return {k: ' '.join(row[x] for x in v) for k, v in plan}
 
inFile = sys.argv[1]

### open csv file
csvfile = open(inFile, "r" )
reader = csv.DictReader(csvfile, delimiter=';')

def main():
    ### compute the merging plan for a given sequence of input headers
    headers = reader.fieldnames
    plan = merging_plan(headers)
     
if __name__ == '__main__':
    main()

plan = merging_plan(reader.fieldnames)
fieldnames = [pair[0] for pair in plan]

outFile = sys.argv[2]
outfile = open(outFile, "w" )

writer = csv.DictWriter(outfile, delimiter=';', fieldnames=fieldnames, extrasaction='ignore')
headers = {n: n for n in writer.fieldnames}
writer.writerow(headers)
for row in reader:
    new_row = merge(plan, row)
    writer.writerow(new_row)

csvfile.close()
outfile.close()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Converting column of values into muliple columns of counts highland44 0 251 Feb-01-2024, 12:48 AM
Last Post: highland44
  Is there a *.bat DOS batch script to *.py Python Script converter? pstein 3 3,177 Jun-29-2023, 11:57 AM
Last Post: gologica
  J2534 Python Can Bus merging natezoom 0 708 May-01-2023, 10:37 PM
Last Post: natezoom
  Reshaping a single column in to multiple column using Python sahar 7 2,041 Jun-20-2022, 12:35 PM
Last Post: deanhystad
  df column aggregate and group by multiple columns SriRajesh 0 1,034 May-06-2022, 02:26 PM
Last Post: SriRajesh
  Transform 3 Columns into Single Column DaveG 8 1,862 Apr-04-2022, 08:42 AM
Last Post: Pedroski55
  Split single column to multiple columns SriRajesh 1 1,319 Jan-07-2022, 06:43 PM
Last Post: jefsummers
  How to remove a column or two columns in a correlation heatmap? lulu43366 3 5,196 Sep-30-2021, 03:47 PM
Last Post: lulu43366
  Merging spreadsheets with the same columns and extracting rows with matching entries johnbernard 3 9,377 Aug-19-2021, 03:08 PM
Last Post: johnbernard
  Index error - columns vs non-column Vinny 3 4,910 Aug-09-2021, 04:46 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020