Python Forum

Hi,

I am new to Python and need help with data cleaning.

The objective is to scrapped off tables from pdf file. That has been done with the tabula package and I have a CSV file.

In the original PDF file, the description can be long (up to 3 -4 lines) as shown in the picture below.
[Image: PDF_table.jpg]

After scrapping, this is what I get in my DataFrame.
[Image: Data_frame.jpg]

I need to combine the rows for the same description together.
Example: I need to combine index 4 and 5 together so that it would read as the following:
Index S/N Code Description Table
4 5 Description Change Breast, Lumps, Imaging Guided Vacuum Assisted Biopsy, Single lesion 2B

It should also delete Index 5 row after combing it together. Finally, I need to set a find and replace function to do it to the whole dataframe.

Please help.
Thanks

What's actually in the csv? Is nan in the file? Or is that just how the dataframe is representing it?

The easiest way to do it, would probably be to read through the file line by line, adding to the previous line's description if a certain column is missing, and then write out the previous line to a different file.

In rough pseudocode, something like:

trailing_line = None
for line in infile:
    if line["Code"] != "":
        if trailing_line:
            print(trailing_line, file=outfile)
        trailing_line = line
    else:
        trailing_line["Classification"] += " " + line["Classification"]
        trailing_line["Description"] += " " + line["Description"]
print(trailing_line, file=outfile)

ClimbAddict

nilamo