Yes, I've also found some confirmation of this statement on the web.
The workaround I'm discovering is as follows:
- extract a page from .pdf
- convert it to text using pdftotext
- finally read text page for processing
I've already tried
pdftotext with "difficult" .pdf file in Linux terminal.
It works fine.
The problem is to use
pdftotext in python.
I've tried to install
pdftotext with this command:
sudo pip3 install pdftotext
But installation failed.
Any suggestions ?
Here is output:
Output:
Collecting pdftotext
Downloading https://files.pythonhosted.org/packages/a6/a7/c202adb0bcd3adc3030b0c5f7f0e21f62a721913e93296e6c4ddc305cbd3/pdftotext-2.1.2.tar.gz (113kB)
100% |████████████████████████████████| 122kB 1.6MB/s
Installing collected packages: pdftotext
Running setup.py install for pdftotext ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_ext
building 'pdftotext' extension
creating build
creating build/temp.linux-x86_64-3.6
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=0 -I/usr/include/python3.6m -c pdftotext.cpp -o build/temp.linux-x86_64-3.6/pdftotext.o -Wall
pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory
#include <poppler/cpp/poppler-document.h>
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-89f6lnjo/pdftotext/
Problem with installing
pdftotext is resolved by installing
libpoppler-cpp-dev package:
Output:
sudo apt-get install libpoppler-cpp-dev
The method works fine:
import os
import pdftotext
from PyPDF2 import PdfFileReader, PdfFileWriter
path = '/media/aaa/WD/pdf/'
file_name_in = path + 'name_of_book.pdf'
file_name_out = path + 'extracted_page.pdf'
pdfFileObj_in = open(file_name_in, 'rb')
pdfReader = PdfFileReader(pdfFileObj_in)
page_number = 2
extracted_page = pdfReader.getPage(page_number)
pdfWriter = PdfFileWriter()
pdfWriter.addPage(extracted_page)
pdfFileObj_out = open(file_name_out, 'wb')
pdfWriter.write(pdfFileObj_out)
pdfFileObj_in.close()
pdfFileObj_out.close()
with open(file_name_out, 'rb') as f:
pdf_text = pdftotext.PDF(f)
for page in pdf_text:
print(page)
Nevertheless, it would be interesting to know how to avoid saving the extracted page on the hard disk, then to read it ... but rather use a extracted page object directly in
pdftotext. The direct approach, that is,
pdftotext(extracted_page) unfortunately does not work.