Python Forum
Extracting text from PDFs
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting text from PDFs
#1
I'm a complete beginner who's trying to use Python to extract specific information from a multiple page PDF and organize that info into a table that can be exported in CSV format. After some testing, I believe that pdfplumber can possible be the best option but I can't find any documentation about pdfplumber explaining what it can do. Does anyone have a better suggestion or know where to find the documentation for this module?

Thanks,
Paulo
Reply
#2
Actually, the documentation for pdfplumber is better than average: https://pypi.org/project/pdfplumber/

there are lots of other packages, see: https://pypi.org/project/pdfplumber/

Success depends on the way your PDF was constructed. Since PDF can contain images of text, or even hand writing it can be extremely difficult to get any reasonable output without expensive OCR software, and even then sketchy to say the least.

On the other hand, if text is organized into well laid out tables you can get very good results.

Some popular packages for this:
Camelot: https://camelot-py.readthedocs.io/en/master/
Excalibur (web wrapper for Camelot): https://github.com/camelot-dev/excalibur
pdfminer.six: https://pypi.org/project/pdfminer.six/
Tablua: https://tabula.technology/
pprod likes this post
Reply
#3
Thanks Larz60+!
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020