PyPDF2 processing problem

Pavel_47 · (This post was last modified: Nov-11-2019, 07:11 PM by Pavel_47.)

Yes, I've also found some confirmation of this statement on the web.
The workaround I'm discovering is as follows:

extract a page from .pdf
convert it to text using pdftotext
finally read text page for processing

I've already tried pdftotext with "difficult" .pdf file in Linux terminal.
It works fine.
The problem is to use pdftotext in python.
I've tried to install pdftotext with this command:
sudo pip3 install pdftotext
But installation failed.
Any suggestions ?

Here is output:

Output:Collecting pdftotext
  Downloading https://files.pythonhosted.org/packages/a6/a7/c202adb0bcd3adc3030b0c5f7f0e21f62a721913e93296e6c4ddc305cbd3/pdftotext-2.1.2.tar.gz (113kB)
    100% |████████████████████████████████| 122kB 1.6MB/s 
Installing collected packages: pdftotext
  Running setup.py install for pdftotext ... error
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_ext
    building 'pdftotext' extension
    creating build
    creating build/temp.linux-x86_64-3.6
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=0 -I/usr/include/python3.6m -c pdftotext.cpp -o build/temp.linux-x86_64-3.6/pdftotext.o -Wall
    pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory
     #include <poppler/cpp/poppler-document.h>
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-89f6lnjo/pdftotext/

Problem with installing pdftotext is resolved by installing libpoppler-cpp-dev package:

Output:
sudo apt-get install libpoppler-cpp-dev

The method works fine:

import os
import pdftotext
from PyPDF2 import PdfFileReader, PdfFileWriter
path = '/media/aaa/WD/pdf/'

file_name_in = path + 'name_of_book.pdf'
file_name_out = path + 'extracted_page.pdf'

pdfFileObj_in = open(file_name_in, 'rb')
pdfReader = PdfFileReader(pdfFileObj_in)
page_number = 2
extracted_page = pdfReader.getPage(page_number)

pdfWriter = PdfFileWriter()
pdfWriter.addPage(extracted_page)

pdfFileObj_out = open(file_name_out, 'wb')
pdfWriter.write(pdfFileObj_out)

pdfFileObj_in.close()
pdfFileObj_out.close()

with open(file_name_out, 'rb') as f:
    pdf_text = pdftotext.PDF(f)
for page in pdf_text:
    print(page)

Nevertheless, it would be interesting to know how to avoid saving the extracted page on the hard disk, then to read it ... but rather use a extracted page object directly in pdftotext. The direct approach, that is, pdftotext(extracted_page) unfortunately does not work.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	PyPDF2 deprecation problem	gowb0w	5	4,463	Sep-21-2023, 12:38 PM Last Post: Pedroski55
	ModuleNotFoundError: No module named 'PyPDF2'	Benitta2525	1	1,638	Aug-07-2023, 05:32 AM Last Post: DPaul
	Pypdf2 will not find text	standenman	2	965	Feb-03-2023, 10:52 PM Last Post: standenman
	pyPDF2 PDFMerger close pensding file	japo85	2	2,488	Jul-28-2022, 09:49 AM Last Post: japo85
	Array problem in pylab module - Image processing	bobfat	0	1,743	Dec-31-2019, 06:02 PM Last Post: bobfat
	Problem with installing PyPDF2	Pavel_47	2	6,091	Nov-10-2019, 02:58 PM Last Post: Pavel_47
	pyPDF2 nautilus columns modification	AJBek	1	2,944	Jun-07-2019, 04:17 PM Last Post: micseydel
	Using Pypdf2 write a string to a pdf file	Pedroski55	6	20,522	Apr-11-2019, 11:10 PM Last Post: snippsat
	Merging pdfs with PyPDF2	Pedroski55	0	3,313	Mar-07-2019, 11:58 PM Last Post: Pedroski55
	PyPDF2 encrypt	Truman	3	5,470	Jan-19-2019, 12:18 AM Last Post: snippsat

PyPDF2 processing problem

User Panel Messages

Announcements