Python Forum
Structuring a semi structured dataset
Thread Rating:
  • 1 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Structuring a semi structured dataset
#1
I have a CSV file containing semi structured dataset. Sample dataset is attached in a CSV file.

df = pd.DataFrame([[np.nan,np.nan,np.nan,np.nan,np.nan], [np.nan,'JOB NO  : ','E1402CJ00001',np.nan,'PROJECT HOTEL BLUE',np.nan],[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],[np.nan,'MRN',np.nan,np.nan,np.nan,np.nan],[np.nan,'MREN1402/316',np.nan,'2014-02-28','EK00NEL047','PVC CASING 24 x 14'],[np.nan,'MREN1402/316',np.nan,'2014-02-28','EK00NEL048','PVC CASING 40 x 16'],[np.nan,'ISSUES',np.nan,np.nan,np.nan,np.nan],[np.nan,'ISEN1402/340',np.nan,'2014-02-28','EK00NEL047','PVC CASING 24 x 14'],[np.nan,'ISEN1402/340',np.nan,'2014-02-28','EK00NEL048','PVC CASING 40 x 16']], columns=['BlankColumn1','REFERENCE','BlankColumn2','DATE','ITEM CODE','ITEM NAME'])

 I want to structure the dataset in order to use for further analysis.
Output:
     JOB NO              JOB NAME                      TYPE      REFERENCE        DATE         ITEM CODE         ITEM NAME                   E1402CJ00001  PROJECT HOTEL BLUE   MRN       MREN1402/316  28-02-14 EK00NEL047    PVC CASING 24 x 14                   E1402CJ00001  PROJECT HOTEL BLUE   MRN      MREN1402/316  28-02-14 EK00NEL048     PVC CASING 40 x 16                   E1402CJ00001  PROJECT HOTEL BLUE   ISSUES  ISEN1402/340    28-02-14 EK00NEL047    PVC CASING 24 x 14                   E1402CJ00001  PROJECT HOTEL BLUE   ISSUES  ISEN1402/340    28-02-14 EK00NEL048    PVC CASING 40 x 16
1. How can i copy the job no & job name to each data row until the next job no apperas?
2. How can i copy the Type of the data row which appears just before the data row untill the next type of the data row appears?
3. Then i want to delete the rows that does not contain data in Reference, Date & Item Code columns

Thank you very much for making an effort to help on this.

Attached Files

.csv   Sample.csv (Size: 338 bytes / Downloads: 90)
Reply
#2
If it gets more complicated, there are some parsing libraries, but for files like yours it is probably easiest to iterate over lines, check if it is job/type line  or casual line and process it accordingly. Example assuming that job or type is always preceded by "empty" line and that there are no other "empty" lines.
def feeder(fname):          # just gives splitted row one by one
    with open(fname) as f:
        for row in f:
            yield row.strip().split(',')

def parse_rows(rows):
    job_info, type_info = ["JOB NO", "JOB NAME"], "TYPE"  # header
    for row in rows:
        if not row[0]:                                # "empty" line
            row = next(rows)
            if row[0] == "JOB NO  : ":                # sets job
                job_info = [row[1], row[3]]
            else:                                     # sets type
                type_info = row[0]
        else:
            yield job_info + [type_info] + row[:1] + row[2:]
You can load it into pandas with:
rows = list(parse_rows(feeder('Sample.csv')))
df = pd.DataFrame(rows[1:], columns=rows[0])
Output:
In [13]: df Out[13]:          JOB NO            JOB NAME    TYPE     REFERENCE      DATE   ITEM CODE           ITEM NAME 0  E1402CJ00001  PROJECT HOTEL BLUE     MRN  MREN1402/316  28-02-14  EK00NEL047  PVC CASING 24 x 14 1  E1402CJ00001  PROJECT HOTEL BLUE     MRN  MREN1402/316  28-02-14  EK00NEL048  PVC CASING 40 x 16 2  E1402CJ00001  PROJECT HOTEL BLUE  ISSUES  ISEN1402/340  28-02-14  EK00NEL047  PVC CASING 24 x 14 3  E1402CJ00001  PROJECT HOTEL BLUE  ISSUES  ISEN1402/340  28-02-14  EK00NEL048  PVC CASING 40 x 16
Or save it like a .csv file:
import csv
with open('output.csv', 'w') as f:
    csv.writer(f).writerows(parse_rows(feeder('Sample.csv')))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to fill datetime64 field in numpy structured array? AlekseyPython 0 2,238 Oct-20-2020, 08:17 AM
Last Post: AlekseyPython
  Numpy structured array Nitram 3 2,791 Sep-17-2019, 10:09 PM
Last Post: jefsummers

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020