Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PyPDF2 processing problem
#3
Yes, I've also found some confirmation of this statement on the web.
The workaround I'm discovering is as follows:
  1. extract a page from .pdf
  2. convert it to text using pdftotext
  3. finally read text page for processing

I've already tried pdftotext with "difficult" .pdf file in Linux terminal.
It works fine.
The problem is to use pdftotext in python.
I've tried to install pdftotext with this command:
sudo pip3 install pdftotext
But installation failed.
Any suggestions ?

Here is output:


Output:
Collecting pdftotext Downloading https://files.pythonhosted.org/packages/a6/a7/c202adb0bcd3adc3030b0c5f7f0e21f62a721913e93296e6c4ddc305cbd3/pdftotext-2.1.2.tar.gz (113kB) 100% |████████████████████████████████| 122kB 1.6MB/s Installing collected packages: pdftotext Running setup.py install for pdftotext ... error Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile: running install running build running build_ext building 'pdftotext' extension creating build creating build/temp.linux-x86_64-3.6 x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=0 -I/usr/include/python3.6m -c pdftotext.cpp -o build/temp.linux-x86_64-3.6/pdftotext.o -Wall pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory #include <poppler/cpp/poppler-document.h> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 ---------------------------------------- Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-89f6lnjo/pdftotext/

Problem with installing pdftotext is resolved by installing libpoppler-cpp-dev package:
Output:
sudo apt-get install libpoppler-cpp-dev

The method works fine:

import os
import pdftotext
from PyPDF2 import PdfFileReader, PdfFileWriter
path = '/media/aaa/WD/pdf/'

file_name_in = path + 'name_of_book.pdf'
file_name_out = path + 'extracted_page.pdf'

pdfFileObj_in = open(file_name_in, 'rb')
pdfReader = PdfFileReader(pdfFileObj_in)
page_number = 2
extracted_page = pdfReader.getPage(page_number)

pdfWriter = PdfFileWriter()
pdfWriter.addPage(extracted_page)

pdfFileObj_out = open(file_name_out, 'wb')
pdfWriter.write(pdfFileObj_out)

pdfFileObj_in.close()
pdfFileObj_out.close()

with open(file_name_out, 'rb') as f:
    pdf_text = pdftotext.PDF(f)
for page in pdf_text:
    print(page)
Nevertheless, it would be interesting to know how to avoid saving the extracted page on the hard disk, then to read it ... but rather use a extracted page object directly in pdftotext. The direct approach, that is, pdftotext(extracted_page) unfortunately does not work.
Reply


Messages In This Thread
PyPDF2 processing problem - by Pavel_47 - Nov-11-2019, 04:52 PM
RE: PyPDF2 processing problem - by Larz60+ - Nov-11-2019, 06:07 PM
RE: PyPDF2 processing problem - by Pavel_47 - Nov-11-2019, 06:28 PM
RE: PyPDF2 processing problem - by Larz60+ - Nov-11-2019, 10:21 PM
RE: PyPDF2 processing problem - by Pavel_47 - Nov-12-2019, 08:33 AM
RE: PyPDF2 processing problem - by Larz60+ - Nov-12-2019, 12:35 PM
RE: PyPDF2 processing problem - by chaitanya - May-04-2021, 06:58 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  PyPDF2 deprecation problem gowb0w 5 4,463 Sep-21-2023, 12:38 PM
Last Post: Pedroski55
  ModuleNotFoundError: No module named 'PyPDF2' Benitta2525 1 1,638 Aug-07-2023, 05:32 AM
Last Post: DPaul
  Pypdf2 will not find text standenman 2 965 Feb-03-2023, 10:52 PM
Last Post: standenman
  pyPDF2 PDFMerger close pensding file japo85 2 2,488 Jul-28-2022, 09:49 AM
Last Post: japo85
  Array problem in pylab module - Image processing bobfat 0 1,743 Dec-31-2019, 06:02 PM
Last Post: bobfat
  Problem with installing PyPDF2 Pavel_47 2 6,091 Nov-10-2019, 02:58 PM
Last Post: Pavel_47
  pyPDF2 nautilus columns modification AJBek 1 2,944 Jun-07-2019, 04:17 PM
Last Post: micseydel
  Using Pypdf2 write a string to a pdf file Pedroski55 6 20,522 Apr-11-2019, 11:10 PM
Last Post: snippsat
  Merging pdfs with PyPDF2 Pedroski55 0 3,313 Mar-07-2019, 11:58 PM
Last Post: Pedroski55
  PyPDF2 encrypt Truman 3 5,470 Jan-19-2019, 12:18 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020