Python Forum
Extracting text from PDFs
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting text from PDFs
#2
Actually, the documentation for pdfplumber is better than average: https://pypi.org/project/pdfplumber/

there are lots of other packages, see: https://pypi.org/project/pdfplumber/

Success depends on the way your PDF was constructed. Since PDF can contain images of text, or even hand writing it can be extremely difficult to get any reasonable output without expensive OCR software, and even then sketchy to say the least.

On the other hand, if text is organized into well laid out tables you can get very good results.

Some popular packages for this:
Camelot: https://camelot-py.readthedocs.io/en/master/
Excalibur (web wrapper for Camelot): https://github.com/camelot-dev/excalibur
pdfminer.six: https://pypi.org/project/pdfminer.six/
Tablua: https://tabula.technology/
pprod likes this post
Reply


Messages In This Thread
Extracting text from PDFs - by pprod - Aug-18-2020, 08:34 AM
RE: Extracting text from PDFs - by Larz60+ - Aug-18-2020, 04:03 PM
RE: Extracting text from PDFs - by pprod - Aug-18-2020, 04:33 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020