Extracting Data into Columns using pdfplumber

arvin · Dec-13-2022, 06:35 AM

I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells.

How could I do that?

For your reference:

Quote:
15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr
11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr
11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr

Thanks in advance

Pedroski55 · Dec-13-2022, 09:21 AM

Looks like each line of your data is separated by a space.

If that is so, just use .split() on each line to get a list of each item of data in the line.

The list can then be written to Excel or, I remember Menator mentioned a module called tabulate which formats text data nicely!

arvin · Dec-13-2022, 10:00 AM

I thought of the same thing, but some lines contain more data and some lines contain less.

DPaul · Dec-13-2022, 10:13 AM

Which method do you use to extract the data from a page?
Paul

arvin · Dec-13-2022, 10:23 AM

Quote:table = pdf.pages[0].extract_table()
df = pd.DataFrame(table[1:])

Pedroski55 · Dec-14-2022, 07:02 AM

What exactly is table? Actual text? Bytes? Or some kind of pdfplumber object?

Are column headers included in table?

arvin · Dec-14-2022, 09:33 AM

Pedroski55 · Dec-15-2022, 05:27 AM

I'm starting to understand your problem!

Using reportlab I made a very simple pdf with 1 table of 4 rows and 5 columns. The table had borders.

The data for the table looks like this:

Quote:data= [['00', '', '02', '', '04'],
['10', '', '12', '', '14'],
['20', '', '22', '', '24'],
['30', '', '32', '', '34']]

As you can see, the columns 2 and 4 contain no data, just ''

Using this, I got the data from the table:

import pdfplumber
# Parse pdf
with pdfplumber.open(path2pdf + savename2) as pdf:
         # Get the first page of the object
    page = pdf.pages[0]
         # Get the text data of the page
    text = page.extract_text()
         # Get all the tabular data of this page
    tables = page.extract_tables()
         # Traversing table
    for t_index in range(len(tables)):
        table = tables[t_index]
                 # Traversing each row of data
        for data in table:
            print(data)

Which gives this as output, you can see the empty values, so they should be in your pandas df as NaN, I suppose.

Output:['00', '', '02', '', '04']
['10', '', '12', '', '14']
['20', '', '22', '', '24']
['30', '', '32', '', '34']

But when I make a table with no borders, there is no table, just text:

Quote:>>> tables
[]
>>> text
'00 02 04\n11 13\n20 22 23 24\n30 31 32 34'
>>>

Now how to know which columns in which rows have nothing in them??

Think think!

arvin · Dec-15-2022, 07:19 AM

I also tried Tabula, but I had no luck.

Pedroski55 · Dec-15-2022, 08:20 AM

Try this then look at table!

import pdfplumber
import re

path2pdf = '/home/pedro/myPython/reportlab_python/my_pdfs/'
savename1 = "simple_table1.pdf" # no borders

pdf = pdfplumber.open(path2pdf + savename1)
 
for page in pdf.pages:
    print(page.extract_text())
    for pdf_table in page.extract_tables():
        table = []
        cells = []
        for row in pdf_table:
            if not any(row):
                # If a line is all empty, it is considered to be the end of a record
                if any(cells):
                    table.append(cells)
                    cells = []
            elif all(row):
                # If the line is not empty, then this line is a new line, the last one ends
                if any(cells):
                    table.append(cells)
                    cells = []
                table.append(row)
            else:
                if len(cells) == 0:
                    cells = row
                else:
                    for i in range(len(row)):
                        if row[i] is not None:
                            cells[i] = row[i] if cells[i] is None else cells[i] + row[i]
        for row in table:
            print([re.sub(' +', ' ', cell) if cell is not None else None for cell in row])
            #print([re.sub(' +', ' ', cell) if cell is not None else ' ' for cell in row])
        print('---------- Dividing line ----------')
 
pdf.close()

Output:>>> table
[['00', '', '02', '', '04'], ['10', '', '12', '', '14'], ['20', '', '22', '', '24'], ['30', '', '32', '', '34']]
>>>

Then you can put table in pandas, I hope!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to properly format rows and columns in excel data from parsed .txt blocks	jh67	7	1,866	Dec-12-2022, 08:22 PM Last Post: jh67
	How to keep columns header on excel without change after export data to excel file?	ahmedbarbary	0	1,150	May-03-2022, 05:46 PM Last Post: ahmedbarbary
	Extracting Data from tables	DataExtrator	0	1,134	Nov-02-2021, 12:24 PM Last Post: DataExtrator
	Merging spreadsheets with the same columns and extracting rows with matching entries	johnbernard	3	9,276	Aug-19-2021, 03:08 PM Last Post: johnbernard
	extracting data	ajitnayak1987	1	1,529	Jul-29-2021, 06:13 AM Last Post: bowlofred
	Extracting and printing data	ajitnayak1987	0	1,406	Jul-28-2021, 09:30 AM Last Post: ajitnayak1987
	Python Pandas: How do I average ONLY the data >1000 from several columns?	JaneTan	0	1,465	Jul-17-2021, 01:34 PM Last Post: JaneTan
	SaltStack: MySQL returner save less data into Database table columns	xtc14	2	2,146	Jul-02-2021, 02:19 PM Last Post: xtc14
	[Solved] Using readlines to read data file and sum columns	Laplace12	4	3,517	Jun-16-2021, 12:46 PM Last Post: Laplace12
	Pandas: how to split one row of data to multiple rows and columns in Python	GerardMoussendo	4	6,784	Feb-22-2021, 06:51 PM Last Post: eddywinch82

Extracting Data into Columns using pdfplumber

User Panel Messages

Announcements