Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pdfminer vs pdfplumber
#1
Hi,
I have been testing pdfplumber and pdfminer and at this stage I am not sure which one I prefer. Pdfminer does a better a job at extracting text from an unstructured pdf but it doesn't seem to be easy to use. It looks like it takes a lot more code to open a pdf on a per page basis with pdfminer than with pdfplumber.

Does anyone know of a more concise way to do that in pdfminer than shown below:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

fp = open('file', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = PDFDevice(rsrcmgr)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
Thanks!
Reply
#2
Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of
  • images
  • pure text
  • tables
  • text as images (which can only be extracted using some form of OCR)
And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html
pprod likes this post
Reply
#3
(Jan-30-2021, 12:17 PM)Larz60+ Wrote: Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of
  • images
  • pure text
  • tables
  • text as images (which can only be extracted using some form of OCR)
And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html


I'll try to get my head around pdfminer.six but I'm struggling to understand how I can make it extract text page by page instead of the whole document at once. For that purpose I recommend pdfplumber:

with pdfplumber.open (r'...\file.pdf') as pdf:
    for page_nr in range(2):   
        page = pdf.pages[page_nr] 
        text = page.extract_text()
        print(text) 
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extracting Data into Columns using pdfplumber arvin 17 5,597 Dec-17-2022, 11:59 AM
Last Post: arvin
  pdfminer package: module isn't found Pavel_47 25 9,043 Sep-18-2022, 08:40 PM
Last Post: Larz60+
  pdfminer to csv mfernandes 2 2,835 Jun-16-2021, 10:54 AM
Last Post: mfernandes
  PDFplumber pprod 2 5,004 Jan-26-2021, 06:12 PM
Last Post: pprod
  pdfminer.six: search for complete documentation Pavel_47 3 2,800 Jan-25-2021, 04:41 PM
Last Post: buran
  pdfminer package: can't find exgtract_text function Pavel_47 7 5,287 Jan-25-2021, 03:31 PM
Last Post: Pavel_47
  PDFplumber pprod 2 2,393 Nov-10-2020, 02:37 PM
Last Post: pprod
  PDFplumber pprod 2 2,080 Nov-06-2020, 08:34 AM
Last Post: pprod
  install pdfminer tkj80 2 11,519 Jan-12-2018, 12:39 AM
Last Post: sparkz_alot

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020