![]() |
Extracting Data into Columns using pdfplumber - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Extracting Data into Columns using pdfplumber (/thread-38947.html) Pages:
1
2
|
Extracting Data into Columns using pdfplumber - arvin - Dec-13-2022 I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells. How could I do that? For your reference: Quote: Thanks in advance RE: Extracting Data into Columns using pdfplumber - Pedroski55 - Dec-13-2022 Looks like each line of your data is separated by a space. If that is so, just use .split() on each line to get a list of each item of data in the line. The list can then be written to Excel or, I remember Menator mentioned a module called tabulate which formats text data nicely! RE: Extracting Data into Columns using pdfplumber - arvin - Dec-13-2022 I thought of the same thing, but some lines contain more data and some lines contain less. RE: Extracting Data into Columns using pdfplumber - DPaul - Dec-13-2022 Which method do you use to extract the data from a page? Paul RE: Extracting Data into Columns using pdfplumber - arvin - Dec-13-2022 Quote:table = pdf.pages[0].extract_table() RE: Extracting Data into Columns using pdfplumber - Pedroski55 - Dec-14-2022 What exactly is table? Actual text? Bytes? Or some kind of pdfplumber object? Are column headers included in table? RE: Extracting Data into Columns using pdfplumber - arvin - Dec-14-2022 ![]() RE: Extracting Data into Columns using pdfplumber - Pedroski55 - Dec-15-2022 I'm starting to understand your problem! Using reportlab I made a very simple pdf with 1 table of 4 rows and 5 columns. The table had borders. The data for the table looks like this: Quote:data= [['00', '', '02', '', '04'], As you can see, the columns 2 and 4 contain no data, just '' Using this, I got the data from the table: import pdfplumber # Parse pdf with pdfplumber.open(path2pdf + savename2) as pdf: # Get the first page of the object page = pdf.pages[0] # Get the text data of the page text = page.extract_text() # Get all the tabular data of this page tables = page.extract_tables() # Traversing table for t_index in range(len(tables)): table = tables[t_index] # Traversing each row of data for data in table: print(data)Which gives this as output, you can see the empty values, so they should be in your pandas df as NaN, I suppose. But when I make a table with no borders, there is no table, just text:Quote:>>> tables Now how to know which columns in which rows have nothing in them?? Think think! RE: Extracting Data into Columns using pdfplumber - arvin - Dec-15-2022 I also tried Tabula, but I had no luck. RE: Extracting Data into Columns using pdfplumber - Pedroski55 - Dec-15-2022 Try this then look at table! import pdfplumber import re path2pdf = '/home/pedro/myPython/reportlab_python/my_pdfs/' savename1 = "simple_table1.pdf" # no borders pdf = pdfplumber.open(path2pdf + savename1) for page in pdf.pages: print(page.extract_text()) for pdf_table in page.extract_tables(): table = [] cells = [] for row in pdf_table: if not any(row): # If a line is all empty, it is considered to be the end of a record if any(cells): table.append(cells) cells = [] elif all(row): # If the line is not empty, then this line is a new line, the last one ends if any(cells): table.append(cells) cells = [] table.append(row) else: if len(cells) == 0: cells = row else: for i in range(len(row)): if row[i] is not None: cells[i] = row[i] if cells[i] is None else cells[i] + row[i] for row in table: print([re.sub(' +', ' ', cell) if cell is not None else None for cell in row]) #print([re.sub(' +', ' ', cell) if cell is not None else ' ' for cell in row]) print('---------- Dividing line ----------') pdf.close() Then you can put table in pandas, I hope!
|