Python Forum

Full Version: Data cleaning help
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,

I am new to Python and need help with data cleaning.

The objective is to scrapped off tables from pdf file. That has been done with the tabula package and I have a CSV file.

In the original PDF file, the description can be long (up to 3 -4 lines) as shown in the picture below.
[Image: PDF_table.jpg]

After scrapping, this is what I get in my DataFrame.
[Image: Data_frame.jpg]

I need to combine the rows for the same description together.
Example: I need to combine index 4 and 5 together so that it would read as the following:
Index S/N Code Description Table
4 5 Description Change Breast, Lumps, Imaging Guided Vacuum Assisted Biopsy, Single lesion 2B

It should also delete Index 5 row after combing it together. Finally, I need to set a find and replace function to do it to the whole dataframe.

Please help.
Thanks
What's actually in the csv? Is nan in the file? Or is that just how the dataframe is representing it?

The easiest way to do it, would probably be to read through the file line by line, adding to the previous line's description if a certain column is missing, and then write out the previous line to a different file.

In rough pseudocode, something like:
trailing_line = None
for line in infile:
    if line["Code"] != "":
        if trailing_line:
            print(trailing_line, file=outfile)
        trailing_line = line
    else:
        trailing_line["Classification"] += " " + line["Classification"]
        trailing_line["Description"] += " " + line["Description"]
print(trailing_line, file=outfile)