Python Forum

Full Version: pdfminer vs pdfplumber
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,
I have been testing pdfplumber and pdfminer and at this stage I am not sure which one I prefer. Pdfminer does a better a job at extracting text from an unstructured pdf but it doesn't seem to be easy to use. It looks like it takes a lot more code to open a pdf on a per page basis with pdfminer than with pdfplumber.

Does anyone know of a more concise way to do that in pdfminer than shown below:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

fp = open('file', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = PDFDevice(rsrcmgr)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
Thanks!
Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of
  • images
  • pure text
  • tables
  • text as images (which can only be extracted using some form of OCR)
And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html
(Jan-30-2021, 12:17 PM)Larz60+ Wrote: [ -> ]Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of
  • images
  • pure text
  • tables
  • text as images (which can only be extracted using some form of OCR)
And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html


I'll try to get my head around pdfminer.six but I'm struggling to understand how I can make it extract text page by page instead of the whole document at once. For that purpose I recommend pdfplumber:

with pdfplumber.open (r'...\file.pdf') as pdf:
    for page_nr in range(2):   
        page = pdf.pages[page_nr] 
        text = page.extract_text()
        print(text)