Structuring a semi structured dataset

klllmmm · Apr-16-2017, 03:15 AM

I have a CSV file containing semi structured dataset. Sample dataset is attached in a CSV file.

df = pd.DataFrame([[np.nan,np.nan,np.nan,np.nan,np.nan], [np.nan,'JOB NO  : ','E1402CJ00001',np.nan,'PROJECT HOTEL BLUE',np.nan],[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],[np.nan,'MRN',np.nan,np.nan,np.nan,np.nan],[np.nan,'MREN1402/316',np.nan,'2014-02-28','EK00NEL047','PVC CASING 24 x 14'],[np.nan,'MREN1402/316',np.nan,'2014-02-28','EK00NEL048','PVC CASING 40 x 16'],[np.nan,'ISSUES',np.nan,np.nan,np.nan,np.nan],[np.nan,'ISEN1402/340',np.nan,'2014-02-28','EK00NEL047','PVC CASING 24 x 14'],[np.nan,'ISEN1402/340',np.nan,'2014-02-28','EK00NEL048','PVC CASING 40 x 16']], columns=['BlankColumn1','REFERENCE','BlankColumn2','DATE','ITEM CODE','ITEM NAME'])

I want to structure the dataset in order to use for further analysis.

Output:      JOB NO              JOB NAME                      TYPE      REFERENCE        DATE         ITEM CODE         ITEM NAME


                  E1402CJ00001  PROJECT HOTEL BLUE   MRN       MREN1402/316  28-02-14 EK00NEL047    PVC CASING 24 x 14
                  E1402CJ00001  PROJECT HOTEL BLUE   MRN      MREN1402/316  28-02-14 EK00NEL048     PVC CASING 40 x 16
                  E1402CJ00001  PROJECT HOTEL BLUE   ISSUES  ISEN1402/340    28-02-14 EK00NEL047    PVC CASING 24 x 14
                  E1402CJ00001  PROJECT HOTEL BLUE   ISSUES  ISEN1402/340    28-02-14 EK00NEL048    PVC CASING 40 x 16

1. How can i copy the job no & job name to each data row until the next job no apperas?
2. How can i copy the Type of the data row which appears just before the data row untill the next type of the data row appears?
3. Then i want to delete the rows that does not contain data in Reference, Date & Item Code columns

Thank you very much for making an effort to help on this.

***zivoni*** · Apr-16-2017, 11:53 AM

If it gets more complicated, there are some parsing libraries, but for files like yours it is probably easiest to iterate over lines, check if it is job/type line or casual line and process it accordingly. Example assuming that job or type is always preceded by "empty" line and that there are no other "empty" lines.

def feeder(fname):          # just gives splitted row one by one
    with open(fname) as f:
        for row in f:
            yield row.strip().split(',')

def parse_rows(rows):
    job_info, type_info = ["JOB NO", "JOB NAME"], "TYPE"  # header
    for row in rows:
        if not row[0]:                                # "empty" line
            row = next(rows)
            if row[0] == "JOB NO  : ":                # sets job
                job_info = [row[1], row[3]]
            else:                                     # sets type
                type_info = row[0]
        else:
            yield job_info + [type_info] + row[:1] + row[2:]

You can load it into pandas with:

rows = list(parse_rows(feeder('Sample.csv')))
df = pd.DataFrame(rows[1:], columns=rows[0])

Output:In [13]: df
Out[13]: 
         JOB NO            JOB NAME    TYPE     REFERENCE      DATE   ITEM CODE           ITEM NAME
0  E1402CJ00001  PROJECT HOTEL BLUE     MRN  MREN1402/316  28-02-14  EK00NEL047  PVC CASING 24 x 14
1  E1402CJ00001  PROJECT HOTEL BLUE     MRN  MREN1402/316  28-02-14  EK00NEL048  PVC CASING 40 x 16
2  E1402CJ00001  PROJECT HOTEL BLUE  ISSUES  ISEN1402/340  28-02-14  EK00NEL047  PVC CASING 24 x 14
3  E1402CJ00001  PROJECT HOTEL BLUE  ISSUES  ISEN1402/340  28-02-14  EK00NEL048  PVC CASING 40 x 16

Or save it like a .csv file:

import csv
with open('output.csv', 'w') as f:
    csv.writer(f).writerows(parse_rows(feeder('Sample.csv')))

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	(Semi)automatic model selection	betop316	1	953	Mar-25-2025, 04:57 PM Last Post: Larz60+
	How to fill datetime64 field in numpy structured array?	AlekseyPython	0	3,036	Oct-20-2020, 08:17 AM Last Post: AlekseyPython
	Numpy structured array	Nitram	3	3,740	Sep-17-2019, 10:09 PM Last Post: jefsummers

Structuring a semi structured dataset

User Panel Messages

Announcements