Yes, I've also found some confirmation of this statement on the web.
The workaround I'm discovering is as follows:
I've already tried pdftotext with "difficult" .pdf file in Linux terminal.
It works fine.
The problem is to use pdftotext in python.
I've tried to install pdftotext with this command:
sudo pip3 install pdftotext
But installation failed.
Any suggestions ?
Here is output:
Problem with installing pdftotext is resolved by installing libpoppler-cpp-dev package:
The method works fine:
The workaround I'm discovering is as follows:
- extract a page from .pdf
- convert it to text using pdftotext
- finally read text page for processing
I've already tried pdftotext with "difficult" .pdf file in Linux terminal.
It works fine.
The problem is to use pdftotext in python.
I've tried to install pdftotext with this command:
sudo pip3 install pdftotext
But installation failed.
Any suggestions ?
Here is output:
Output:Collecting pdftotext
Downloading https://files.pythonhosted.org/packages/a6/a7/c202adb0bcd3adc3030b0c5f7f0e21f62a721913e93296e6c4ddc305cbd3/pdftotext-2.1.2.tar.gz (113kB)
100% |████████████████████████████████| 122kB 1.6MB/s
Installing collected packages: pdftotext
Running setup.py install for pdftotext ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_ext
building 'pdftotext' extension
creating build
creating build/temp.linux-x86_64-3.6
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=0 -I/usr/include/python3.6m -c pdftotext.cpp -o build/temp.linux-x86_64-3.6/pdftotext.o -Wall
pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory
#include <poppler/cpp/poppler-document.h>
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-89f6lnjo/pdftotext/
Problem with installing pdftotext is resolved by installing libpoppler-cpp-dev package:
Output:sudo apt-get install libpoppler-cpp-dev
The method works fine:
import os import pdftotext from PyPDF2 import PdfFileReader, PdfFileWriter path = '/media/aaa/WD/pdf/' file_name_in = path + 'name_of_book.pdf' file_name_out = path + 'extracted_page.pdf' pdfFileObj_in = open(file_name_in, 'rb') pdfReader = PdfFileReader(pdfFileObj_in) page_number = 2 extracted_page = pdfReader.getPage(page_number) pdfWriter = PdfFileWriter() pdfWriter.addPage(extracted_page) pdfFileObj_out = open(file_name_out, 'wb') pdfWriter.write(pdfFileObj_out) pdfFileObj_in.close() pdfFileObj_out.close() with open(file_name_out, 'rb') as f: pdf_text = pdftotext.PDF(f) for page in pdf_text: print(page)Nevertheless, it would be interesting to know how to avoid saving the extracted page on the hard disk, then to read it ... but rather use a extracted page object directly in pdftotext. The direct approach, that is, pdftotext(extracted_page) unfortunately does not work.