PyPDF2 processing problem - Printable Version

PyPDF2 processing problem - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: PyPDF2 processing problem (/thread-22407.html)

PyPDF2 processing problem - Pavel_47 - Nov-11-2019

Hello,

I've met problem using PyPDF2 module.
For some books the text extraction works, for others - not (i.e. text is empty)

page_number = 11
pageObj = pdfReader.getPage(page_number)
text = pageObj.extractText()

Any ideas ?

Thanks.

According to my observation it doesn't work for "Pack Publishing"

RE: PyPDF2 processing problem - Larz60+ - Nov-11-2019

unfortunately this is the nature of PDF, a most data unfriendly format.

RE: PyPDF2 processing problem - Pavel_47 - Nov-11-2019

Yes, I've also found some confirmation of this statement on the web.
The workaround I'm discovering is as follows:

extract a page from .pdf
convert it to text using pdftotext
finally read text page for processing

I've already tried pdftotext with "difficult" .pdf file in Linux terminal.
It works fine.
The problem is to use pdftotext in python.
I've tried to install pdftotext with this command:
sudo pip3 install pdftotext
But installation failed.
Any suggestions ?

Here is output:

Output:Collecting pdftotext
  Downloading https://files.pythonhosted.org/packages/a6/a7/c202adb0bcd3adc3030b0c5f7f0e21f62a721913e93296e6c4ddc305cbd3/pdftotext-2.1.2.tar.gz (113kB)
    100% |████████████████████████████████| 122kB 1.6MB/s 
Installing collected packages: pdftotext
  Running setup.py install for pdftotext ... error
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_ext
    building 'pdftotext' extension
    creating build
    creating build/temp.linux-x86_64-3.6
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=0 -I/usr/include/python3.6m -c pdftotext.cpp -o build/temp.linux-x86_64-3.6/pdftotext.o -Wall
    pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory
     #include <poppler/cpp/poppler-document.h>
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-89f6lnjo/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-jkqwtndx-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-89f6lnjo/pdftotext/

Problem with installing pdftotext is resolved by installing libpoppler-cpp-dev package:

Output:
sudo apt-get install libpoppler-cpp-dev

The method works fine:

import os
import pdftotext
from PyPDF2 import PdfFileReader, PdfFileWriter
path = '/media/aaa/WD/pdf/'

file_name_in = path + 'name_of_book.pdf'
file_name_out = path + 'extracted_page.pdf'

pdfFileObj_in = open(file_name_in, 'rb')
pdfReader = PdfFileReader(pdfFileObj_in)
page_number = 2
extracted_page = pdfReader.getPage(page_number)

pdfWriter = PdfFileWriter()
pdfWriter.addPage(extracted_page)

pdfFileObj_out = open(file_name_out, 'wb')
pdfWriter.write(pdfFileObj_out)

pdfFileObj_in.close()
pdfFileObj_out.close()

with open(file_name_out, 'rb') as f:
    pdf_text = pdftotext.PDF(f)
for page in pdf_text:
    print(page)

Nevertheless, it would be interesting to know how to avoid saving the extracted page on the hard disk, then to read it ... but rather use a extracted page object directly in pdftotext. The direct approach, that is, pdftotext(extracted_page) unfortunately does not work.

RE: PyPDF2 processing problem - Larz60+ - Nov-11-2019

there is a blog on this subject here: http://www.verypdf.com/wordpress/201701/how-to-convert-from-pdf-to-text-in-memory-completely-43204.html

RE: PyPDF2 processing problem - Pavel_47 - Nov-12-2019

Thanks.
It will be nice to have something similar in Python ... and under Linux

RE: PyPDF2 processing problem - Larz60+ - Nov-12-2019

pypdf2 was written in 'C' in 2005, and last released in 2016
FYI there's also https://pypi.org/project/PyPDF3/ which is a pure python version (which I haven't tried)

RE: PyPDF2 processing problem - chaitanya - May-04-2021

Hi All

Using Pypdf2 IM trying to resize pdf page from existing(549,749) size to new size 2308,3500 able to resize the page but not text accordingly. I need text also to be resize along with the page
below is the code I used:

import PyPDF2
pdfFileObj = open(path, 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pageObj.mediaBox
pageObj.scaleTo(2308,3500)

writer = PyPDF2.PdfFileWriter() 
writer.addPage(pageObj)
with open(r"intput", "wb+") as f:
    writer.write(f)
pdfFileObj.close()