Extracting text from PDFs

pprod · Aug-18-2020, 08:34 AM

I'm a complete beginner who's trying to use Python to extract specific information from a multiple page PDF and organize that info into a table that can be exported in CSV format. After some testing, I believe that pdfplumber can possible be the best option but I can't find any documentation about pdfplumber explaining what it can do. Does anyone have a better suggestion or know where to find the documentation for this module?

Thanks,
Paulo

**Larz60+** · Aug-18-2020, 04:03 PM

Actually, the documentation for pdfplumber is better than average: https://pypi.org/project/pdfplumber/

there are lots of other packages, see: https://pypi.org/project/pdfplumber/

Success depends on the way your PDF was constructed. Since PDF can contain images of text, or even hand writing it can be extremely difficult to get any reasonable output without expensive OCR software, and even then sketchy to say the least.

On the other hand, if text is organized into well laid out tables you can get very good results.

Some popular packages for this:
Camelot: https://camelot-py.readthedocs.io/en/master/
Excalibur (web wrapper for Camelot): https://github.com/camelot-dev/excalibur
pdfminer.six: https://pypi.org/project/pdfminer.six/
Tablua: https://tabula.technology/

pprod · Aug-18-2020, 04:33 PM

Thanks Larz60+!

Extracting text from PDFs

User Panel Messages

Announcements