Python Forum
Extract data from PDF page to Excel - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Extract data from PDF page to Excel (/thread-30642.html)



Extract data from PDF page to Excel - nathan_nz - Oct-29-2020

Hi everyone, I am very new to coding and am wanting to create some code in python to extract data from a PDF file and transfer it into an excel sheet. This would allow easier filtering and analysis as the reports can be up to 100 pages long and are received monthly. Each page except for the first follow the same format (the first page can be ignored). The image below highlights the sections of the page I’d like to extract into individual columns. In some cases, Recommendation is left blank or no image is provided. I understand I’d be able to use While loops in some cases here but have no idea how to format or other functions to use.



In terms of functionality I was thinking it’d open up a macro enabled template, run the macro which lets me select the appropriate pdf file and extracts the data from there.



Also, it’d be awesome to make the image show with mouseover the cell using comments if any one has a suggestion on how to do that.



Survey Date:

Type:

Area:

Priority: (Coloured number at top right corner) Can be N, 0 , 1 , 2, 3

Machine:

Assembly:

Detail:

Recommendation:

Image:

Wonder if I can send sample of image through PM as I can't currently attach to this thread.

Thanks!


RE: Extract data from PDF page to Excel - Larz60+ - Oct-29-2020

There are many modules that aid in PDF data extraction.
Because PDF is sort of a chameleon when it comes to internal contents, it's a bear, in many cases, to extract intelligible data from one, sometimes you luck out (usually when data is presented in table format), and sometimes, conversion is just impossible (if data is a very poor image of a text document, for example).
At any rate, I've had some success with:

camelot-py (which wraps around pdfminer): https://pypi.org/project/camelot-py/

pdfminer.six: https://github.com/pdfminer/pdfminer.six

there are a ton of others, if you don't have success with above, look here: https://pypi.org/search/?q=PDF&o=