Extracting Data into Columns using pdfplumber

arvin · Dec-15-2022, 09:41 AM

I appreciate your efforts, but I'm still unable to turn it into a dataframe. Additionally, I've noticed that some values in the narration column appear in multiple lines.

Pedroski55 · Dec-15-2022, 12:15 PM

pdfs are complicated things!

Maybe open it with reportlab and save with reportlab??

I can use table to make a dataframe.

Quote:>>> import pandas as pd
>>> df = pd.DataFrame(table)
>>> print(df)
0 1 2 3 4
0 00 02 04
1 10 12 14
2 20 22 24
3 30 32 34
>>>

5 columns, columns 1 and 3 empty.

Pedroski55 · Dec-15-2022, 11:03 PM

Can you post 1 page of your pdf with a table? Or the whole thing if it's not too big?

Pedroski55 · (This post was last modified: Dec-16-2022, 07:16 AM by Pedroski55.)

Look here.

But it is still not quite right, some other formatting need to be set.

pdf_table = page.extract_tables(table_settings={
    "vertical_strategy": "text", 
    "horizontal_strategy": "text",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "join_x_tolerance": 3,
    "join_y_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": True,
    "text_tolerance": 3,
    "text_x_tolerance": 3,
    "text_y_tolerance": 3,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": 3,
    "intersection_y_tolerance": 3,
})

Definitely needs tweaking, but set  "min_words_vertical": 3 to 1 gets very close.

Pedroski55 · Dec-16-2022, 07:27 AM

Quote:>>> pdf_table
[[['00', '', '02', '', '04'], ['', '', '', '', ''], ['', '11', '', '13', ''], ['', '', '', '', ''], ['20', '', '22', '23', '24'], ['', '', '', '', ''], ['30', '31', '32', '', '34']]]
>>> cells
[['00', '', '02', '', '04'], ['', '', '', '', ''], ['', '11', '', '13', ''], ['', '', '', '', ''], ['20', '', '22', '23', '24'], ['', '', '', '', ''], ['30', '31', '32', '', '34']]
>>> for p in range(0, len(cells), 2):
print(cells[p])

['00', '', '02', '', '04']
['', '11', '', '13', '']
['20', '', '22', '23', '24']
['30', '31', '32', '', '34']

The above represents the table without borders, including the empty cells.
>>>

arvin · Dec-16-2022, 10:22 AM

Here it is - https://we.tl/t-RuSjV2pXyh

Pedroski55 · Dec-17-2022, 08:10 AM

The old admonition "read the docs" was never truer!! This was interesting for me though!

This should do it. I used your pdf. I only cropped out 6 lines for testing, you should change the vertical y1.

Explicitly state the vertical edges of the cells, like an imaginary line running through the page.

I would load and crop the page in a loop, changing the coords until you have what you want. You know that when you call im.show().

Make a while loop function to return bounding_box.

Exit the loop when you are happy with the resulting image.

import pdfplumber

path2pdf = '/home/pedro/myPython/pdfplumber/pdfs/'
my_pdf = 'sample.pdf'
bounding_box = (10, 350, 800, 460)
pdf1 = pdfplumber.open(path2pdf + my_pdf)
page = pdf1.pages[0]
page.width
page.height
cropped_page = page.crop(bounding_box)
im = pdf1.pages[0].to_image(resolution=150)
im = cropped_page.to_image(resolution=150)
im.show()
im.save(path2pdf + "test1.png", format="PNG")
# for a table without borders vertical_strategy": "text" "horizontal_strategy": "text"
pdf_table = cropped_page.extract_tables(table_settings={
    "vertical_strategy": "lines_strict", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [5, 65, 350, 450, 550, 700],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "join_x_tolerance": 3,
    "join_y_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 3,
    "keep_blank_chars": True,
    "text_tolerance": 5,
    "text_x_tolerance": 5,
    "text_y_tolerance": 3,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": 3,
    "intersection_y_tolerance": 3,
})
type(pdf_table) # list

for l in pdf_table[0]:
    print('length of row is', len(l))
    print(l)

The docs recommend cropping the image. That will save you trouble later.

I do a similar thing with my multichoice questions answer forms, because I need to go from top to bottom in columns, so, in a loop, I crop, check the image, crop again until it is how I need it, then save the bounding box coordinates for each column. If you only wanted 1 column, you could do that easily!

Doesn't take long, and once saved, you can use the same coords each time for the same answer form format. Or in your case, the same pdf format, once you know the coordinates.

You should have 6 columns each row.

Output:>>> for l in pdf_table[0]:
    print('length of row is', len(l))
    print(l)

    
length of row is 6
['05/04/2021', 'IMPS/P2A/109407241841/XXXXXXXXXX2155/sector103', '', '32,820.00', '', '39,65,685.65Cr']
length of row is 6
['03/04/2021', 'BY INST 420011 : MICR CLG (CTS)', '', '', '1,50,000.00', '39,98,505.65Cr']
length of row is 6
['31/03/2021', 'BY INST 153187 : MICR CLG (CTS)', '', '', '4,00,000.00', '38,48,505.65Cr']
length of row is 6
['29/03/2021', 'BY INST 14543 : MICR CLG (CTS)', '', '', '4,50,000.00', '34,48,505.65Cr']
length of row is 6
['29/03/2021', 'BY INST 817608 : MICR CLG (CTS)', '', '', '9,25,000.00', '29,98,505.65Cr']
length of row is 6
['29/03/2021', 'BY INST 751569 : MICR CLG (CTS)', '', '', '1,50,000.00', '20,73,505.65Cr']
>>>

arvin · Dec-17-2022, 11:59 AM

Oh yes! ...many thanks for your assistance. But the multiple line problem is still there. Sad

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to properly format rows and columns in excel data from parsed .txt blocks	jh67	7	1,872	Dec-12-2022, 08:22 PM Last Post: jh67
	How to keep columns header on excel without change after export data to excel file?	ahmedbarbary	0	1,159	May-03-2022, 05:46 PM Last Post: ahmedbarbary
	Extracting Data from tables	DataExtrator	0	1,137	Nov-02-2021, 12:24 PM Last Post: DataExtrator
	Merging spreadsheets with the same columns and extracting rows with matching entries	johnbernard	3	9,411	Aug-19-2021, 03:08 PM Last Post: johnbernard
	extracting data	ajitnayak1987	1	1,532	Jul-29-2021, 06:13 AM Last Post: bowlofred
	Extracting and printing data	ajitnayak1987	0	1,408	Jul-28-2021, 09:30 AM Last Post: ajitnayak1987
	Python Pandas: How do I average ONLY the data >1000 from several columns?	JaneTan	0	1,467	Jul-17-2021, 01:34 PM Last Post: JaneTan
	SaltStack: MySQL returner save less data into Database table columns	xtc14	2	2,152	Jul-02-2021, 02:19 PM Last Post: xtc14
	[Solved] Using readlines to read data file and sum columns	Laplace12	4	3,529	Jun-16-2021, 12:46 PM Last Post: Laplace12
	Pandas: how to split one row of data to multiple rows and columns in Python	GerardMoussendo	4	6,801	Feb-22-2021, 06:51 PM Last Post: eddywinch82

Extracting Data into Columns using pdfplumber

User Panel Messages

Announcements