Extracting text from PDFs

**Larz60+** · Aug-18-2020, 04:03 PM

Actually, the documentation for pdfplumber is better than average: https://pypi.org/project/pdfplumber/

there are lots of other packages, see: https://pypi.org/project/pdfplumber/

Success depends on the way your PDF was constructed. Since PDF can contain images of text, or even hand writing it can be extremely difficult to get any reasonable output without expensive OCR software, and even then sketchy to say the least.

On the other hand, if text is organized into well laid out tables you can get very good results.

Some popular packages for this:
Camelot: https://camelot-py.readthedocs.io/en/master/
Excalibur (web wrapper for Camelot): https://github.com/camelot-dev/excalibur
pdfminer.six: https://pypi.org/project/pdfminer.six/
Tablua: https://tabula.technology/