PyPDF2 processing problem - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: PyPDF2 processing problem (/thread-22407.html) |
PyPDF2 processing problem - Pavel_47 - Nov-11-2019 Hello, I've met problem using PyPDF2 module. For some books the text extraction works, for others - not (i.e. text is empty) page_number = 11 pageObj = pdfReader.getPage(page_number) text = pageObj.extractText()Any ideas ? Thanks. According to my observation it doesn't work for "Pack Publishing" RE: PyPDF2 processing problem - Larz60+ - Nov-11-2019 unfortunately this is the nature of PDF, a most data unfriendly format. RE: PyPDF2 processing problem - Pavel_47 - Nov-11-2019 Yes, I've also found some confirmation of this statement on the web. The workaround I'm discovering is as follows:
I've already tried pdftotext with "difficult" .pdf file in Linux terminal. It works fine. The problem is to use pdftotext in python. I've tried to install pdftotext with this command: sudo pip3 install pdftotext But installation failed. Any suggestions ? Here is output:
Problem with installing pdftotext is resolved by installing libpoppler-cpp-dev package:
The method works fine: import os import pdftotext from PyPDF2 import PdfFileReader, PdfFileWriter path = '/media/aaa/WD/pdf/' file_name_in = path + 'name_of_book.pdf' file_name_out = path + 'extracted_page.pdf' pdfFileObj_in = open(file_name_in, 'rb') pdfReader = PdfFileReader(pdfFileObj_in) page_number = 2 extracted_page = pdfReader.getPage(page_number) pdfWriter = PdfFileWriter() pdfWriter.addPage(extracted_page) pdfFileObj_out = open(file_name_out, 'wb') pdfWriter.write(pdfFileObj_out) pdfFileObj_in.close() pdfFileObj_out.close() with open(file_name_out, 'rb') as f: pdf_text = pdftotext.PDF(f) for page in pdf_text: print(page)Nevertheless, it would be interesting to know how to avoid saving the extracted page on the hard disk, then to read it ... but rather use a extracted page object directly in pdftotext. The direct approach, that is, pdftotext(extracted_page) unfortunately does not work. RE: PyPDF2 processing problem - Larz60+ - Nov-11-2019 there is a blog on this subject here: http://www.verypdf.com/wordpress/201701/how-to-convert-from-pdf-to-text-in-memory-completely-43204.html RE: PyPDF2 processing problem - Pavel_47 - Nov-12-2019 Thanks. It will be nice to have something similar in Python ... and under Linux RE: PyPDF2 processing problem - Larz60+ - Nov-12-2019 pypdf2 was written in 'C' in 2005, and last released in 2016 FYI there's also https://pypi.org/project/PyPDF3/ which is a pure python version (which I haven't tried) RE: PyPDF2 processing problem - chaitanya - May-04-2021 Hi All Using Pypdf2 IM trying to resize pdf page from existing(549,749) size to new size 2308,3500 able to resize the page but not text accordingly. I need text also to be resize along with the page below is the code I used: import PyPDF2 pdfFileObj = open(path, 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(0) pageObj.mediaBox pageObj.scaleTo(2308,3500) writer = PyPDF2.PdfFileWriter() writer.addPage(pageObj) with open(r"intput", "wb+") as f: writer.write(f) pdfFileObj.close() |