Dec-17-2022, 08:10 AM
The old admonition "read the docs" was never truer!! This was interesting for me though!
This should do it. I used your pdf. I only cropped out 6 lines for testing, you should change the vertical y1.
Explicitly state the vertical edges of the cells, like an imaginary line running through the page.
I would load and crop the page in a loop, changing the coords until you have what you want. You know that when you call im.show().
Make a while loop function to return bounding_box.
Exit the loop when you are happy with the resulting image.
I do a similar thing with my multichoice questions answer forms, because I need to go from top to bottom in columns, so, in a loop, I crop, check the image, crop again until it is how I need it, then save the bounding box coordinates for each column. If you only wanted 1 column, you could do that easily!
Doesn't take long, and once saved, you can use the same coords each time for the same answer form format. Or in your case, the same pdf format, once you know the coordinates.
You should have 6 columns each row.
This should do it. I used your pdf. I only cropped out 6 lines for testing, you should change the vertical y1.
Explicitly state the vertical edges of the cells, like an imaginary line running through the page.
I would load and crop the page in a loop, changing the coords until you have what you want. You know that when you call im.show().
Make a while loop function to return bounding_box.
Exit the loop when you are happy with the resulting image.
import pdfplumber path2pdf = '/home/pedro/myPython/pdfplumber/pdfs/' my_pdf = 'sample.pdf' bounding_box = (10, 350, 800, 460) pdf1 = pdfplumber.open(path2pdf + my_pdf) page = pdf1.pages[0] page.width page.height cropped_page = page.crop(bounding_box) im = pdf1.pages[0].to_image(resolution=150) im = cropped_page.to_image(resolution=150) im.show() im.save(path2pdf + "test1.png", format="PNG") # for a table without borders vertical_strategy": "text" "horizontal_strategy": "text" pdf_table = cropped_page.extract_tables(table_settings={ "vertical_strategy": "lines_strict", "horizontal_strategy": "lines", "explicit_vertical_lines": [5, 65, 350, 450, 550, 700], "explicit_horizontal_lines": [], "snap_tolerance": 3, "snap_x_tolerance": 3, "snap_y_tolerance": 3, "join_tolerance": 3, "join_x_tolerance": 3, "join_y_tolerance": 3, "edge_min_length": 3, "min_words_vertical": 3, "min_words_horizontal": 3, "keep_blank_chars": True, "text_tolerance": 5, "text_x_tolerance": 5, "text_y_tolerance": 3, "intersection_tolerance": 3, "intersection_x_tolerance": 3, "intersection_y_tolerance": 3, }) type(pdf_table) # list for l in pdf_table[0]: print('length of row is', len(l)) print(l)The docs recommend cropping the image. That will save you trouble later.
I do a similar thing with my multichoice questions answer forms, because I need to go from top to bottom in columns, so, in a loop, I crop, check the image, crop again until it is how I need it, then save the bounding box coordinates for each column. If you only wanted 1 column, you could do that easily!
Doesn't take long, and once saved, you can use the same coords each time for the same answer form format. Or in your case, the same pdf format, once you know the coordinates.
You should have 6 columns each row.
Output:>>> for l in pdf_table[0]:
print('length of row is', len(l))
print(l)
length of row is 6
['05/04/2021', 'IMPS/P2A/109407241841/XXXXXXXXXX2155/sector103', '', '32,820.00', '', '39,65,685.65Cr']
length of row is 6
['03/04/2021', 'BY INST 420011 : MICR CLG (CTS)', '', '', '1,50,000.00', '39,98,505.65Cr']
length of row is 6
['31/03/2021', 'BY INST 153187 : MICR CLG (CTS)', '', '', '4,00,000.00', '38,48,505.65Cr']
length of row is 6
['29/03/2021', 'BY INST 14543 : MICR CLG (CTS)', '', '', '4,50,000.00', '34,48,505.65Cr']
length of row is 6
['29/03/2021', 'BY INST 817608 : MICR CLG (CTS)', '', '', '9,25,000.00', '29,98,505.65Cr']
length of row is 6
['29/03/2021', 'BY INST 751569 : MICR CLG (CTS)', '', '', '1,50,000.00', '20,73,505.65Cr']
>>>