Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Data cleaning help
#1
Hi,

I am new to Python and need help with data cleaning.

The objective is to scrapped off tables from pdf file. That has been done with the tabula package and I have a CSV file.

In the original PDF file, the description can be long (up to 3 -4 lines) as shown in the picture below.
[Image: PDF_table.jpg]

After scrapping, this is what I get in my DataFrame.
[Image: Data_frame.jpg]

I need to combine the rows for the same description together.
Example: I need to combine index 4 and 5 together so that it would read as the following:
Index S/N Code Description Table
4 5 Description Change Breast, Lumps, Imaging Guided Vacuum Assisted Biopsy, Single lesion 2B

It should also delete Index 5 row after combing it together. Finally, I need to set a find and replace function to do it to the whole dataframe.

Please help.
Thanks
Reply
#2
What's actually in the csv? Is nan in the file? Or is that just how the dataframe is representing it?

The easiest way to do it, would probably be to read through the file line by line, adding to the previous line's description if a certain column is missing, and then write out the previous line to a different file.

In rough pseudocode, something like:
trailing_line = None
for line in infile:
    if line["Code"] != "":
        if trailing_line:
            print(trailing_line, file=outfile)
        trailing_line = line
    else:
        trailing_line["Classification"] += " " + line["Classification"]
        trailing_line["Description"] += " " + line["Description"]
print(trailing_line, file=outfile)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Cleaning my code to make it more efficient BSDevo 13 1,358 Sep-27-2023, 10:39 PM
Last Post: BSDevo
  Apply textual data cleaning to several CSV files ErcoleL99 0 832 Jul-09-2022, 03:01 PM
Last Post: ErcoleL99
  [SOLVED] Why does regex fail cleaning line? Winfried 5 2,455 Aug-22-2021, 06:59 PM
Last Post: Winfried

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020