Python Forum

Hi,
I have been testing pdfplumber and pdfminer and at this stage I am not sure which one I prefer. Pdfminer does a better a job at extracting text from an unstructured pdf but it doesn't seem to be easy to use. It looks like it takes a lot more code to open a pdf on a per page basis with pdfminer than with pdfplumber.

Does anyone know of a more concise way to do that in pdfminer than shown below:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

fp = open('file', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = PDFDevice(rsrcmgr)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)

Thanks!

Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of

images
pure text
tables
text as images (which can only be extracted using some form of OCR)

And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html

(Jan-30-2021, 12:17 PM)Larz60+ Wrote: [ -> ]Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of
images

pure text

tables

text as images (which can only be extracted using some form of OCR)

And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html

I'll try to get my head around pdfminer.six but I'm struggling to understand how I can make it extract text page by page instead of the whole document at once. For that purpose I recommend pdfplumber:

with pdfplumber.open (r'...\file.pdf') as pdf:
    for page_nr in range(2):   
        page = pdf.pages[page_nr] 
        text = page.extract_text()
        print(text)

pprod

Larz60+

pprod