Posts: 22
Threads: 5
Joined: Dec 2022
I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells.
How could I do that?
For your reference:
Quote:
15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr
11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr
11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr
Thanks in advance
Posts: 828
Threads: 130
Joined: Jul 2017
Looks like each line of your data is separated by a space.
If that is so, just use .split() on each line to get a list of each item of data in the line.
The list can then be written to Excel or, I remember Menator mentioned a module called tabulate which formats text data nicely!
Posts: 22
Threads: 5
Joined: Dec 2022
I thought of the same thing, but some lines contain more data and some lines contain less.
Posts: 724
Threads: 114
Joined: Dec 2017
Which method do you use to extract the data from a page?
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Posts: 22
Threads: 5
Joined: Dec 2022
Quote:table = pdf.pages[0].extract_table()
df = pd.DataFrame(table[1:])
Posts: 828
Threads: 130
Joined: Jul 2017
What exactly is table? Actual text? Bytes? Or some kind of pdfplumber object?
Are column headers included in table?
Posts: 22
Threads: 5
Joined: Dec 2022
Posts: 828
Threads: 130
Joined: Jul 2017
I'm starting to understand your problem!
Using reportlab I made a very simple pdf with 1 table of 4 rows and 5 columns. The table had borders.
The data for the table looks like this:
Quote:data= [['00', '', '02', '', '04'],
['10', '', '12', '', '14'],
['20', '', '22', '', '24'],
['30', '', '32', '', '34']]
As you can see, the columns 2 and 4 contain no data, just ''
Using this, I got the data from the table:
import pdfplumber
# Parse pdf
with pdfplumber.open(path2pdf + savename2) as pdf:
# Get the first page of the object
page = pdf.pages[0]
# Get the text data of the page
text = page.extract_text()
# Get all the tabular data of this page
tables = page.extract_tables()
# Traversing table
for t_index in range(len(tables)):
table = tables[t_index]
# Traversing each row of data
for data in table:
print(data) Which gives this as output, you can see the empty values, so they should be in your pandas df as NaN, I suppose.
Output: ['00', '', '02', '', '04']
['10', '', '12', '', '14']
['20', '', '22', '', '24']
['30', '', '32', '', '34']
But when I make a table with no borders, there is no table, just text:
Quote:>>> tables
[]
>>> text
'00 02 04\n11 13\n20 22 23 24\n30 31 32 34'
>>>
Now how to know which columns in which rows have nothing in them??
Think think!
Posts: 22
Threads: 5
Joined: Dec 2022
I also tried Tabula, but I had no luck.
Posts: 828
Threads: 130
Joined: Jul 2017
Try this then look at table!
import pdfplumber
import re
path2pdf = '/home/pedro/myPython/reportlab_python/my_pdfs/'
savename1 = "simple_table1.pdf" # no borders
pdf = pdfplumber.open(path2pdf + savename1)
for page in pdf.pages:
print(page.extract_text())
for pdf_table in page.extract_tables():
table = []
cells = []
for row in pdf_table:
if not any(row):
# If a line is all empty, it is considered to be the end of a record
if any(cells):
table.append(cells)
cells = []
elif all(row):
# If the line is not empty, then this line is a new line, the last one ends
if any(cells):
table.append(cells)
cells = []
table.append(row)
else:
if len(cells) == 0:
cells = row
else:
for i in range(len(row)):
if row[i] is not None:
cells[i] = row[i] if cells[i] is None else cells[i] + row[i]
for row in table:
print([re.sub(' +', ' ', cell) if cell is not None else None for cell in row])
#print([re.sub(' +', ' ', cell) if cell is not None else ' ' for cell in row])
print('---------- Dividing line ----------')
pdf.close() Output: >>> table
[['00', '', '02', '', '04'], ['10', '', '12', '', '14'], ['20', '', '22', '', '24'], ['30', '', '32', '', '34']]
>>>
Then you can put table in pandas, I hope!
|