Python Forum

Full Version: How to remove footer from PDF when extracting to text
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,
I'm trying to take a footer out of a 550 page pdf and then extract everything left to a .txt file. The extraction is working but the footer is still there. I'm not understanding why this isn't working.

Footer:

Quote:JOHN DOE | List Collection 11 of 550
Proprietary & Confidential

with pdfplumber.open(pdfFilePath) as pdf:
    k = len(pdf.pages)
    for i in range(1, k):
        page = pdf.pages[i]
        text = (page.extract_text())
        footer_pattern = re.search("^JOHN.*Confidential$", text)
        if footer_pattern:
            new_text = text.replace(footer_pattern, '')
            with open(txtFilePath, 'a') as txtFile:
                txtFile.write(new_text)
With this code, the new text file is created but empty. If I change it to
 if footer_pattern:
            text = text.replace(footer_pattern, '')
            with open(txtFilePath, 'a') as txtFile:
                txtFile.write(text)
the footers are still there, I've tried it in a simpler block like this and it seems to work. I think it may have to do with a line break or something but I honestly don't know:

txt = "JOHN DOE | List Collection Proprietary & Confidential"
x = re.search("^JOHN.*Confidential$", txt)

if x:
  newtext = txt.replace(txt, 'success')
  print(newtext)
else:
  print("No match")
Also, I believe the page numbers are separate from the text in the footer and I can't seem to remove those either.
Try with the re.DOTALL flag
footer_pattern = re.search("(?s)^JOHN.*Confidential$", text)
(Dec-12-2022, 05:59 PM)Gribouillis Wrote: [ -> ]Try with the re.DOTALL flag
footer_pattern = re.search("(?s)^JOHN.*Confidential$", text)

Just added the (?s) and re.DOTALL and got the same result. Footer is still in the text file. Does this look correct for that flag?

with pdfplumber.open(pdfFilePath) as pdf:
    k = len(pdf.pages)
    for i in range(1, k):
        page = pdf.pages[i]
        text = (page.extract_text())
        footer_pattern = re.search("(?s)^JOHN.*Confidential$", text, re.DOTALL)

        if footer_pattern:
            text = text.replace(footer_pattern, '')

        with open(txtFilePath, 'a') as txtFile:
            txtFile.write(text)
Hi, I come across this on zillions of documents,
but I use another OCR module.
Still, if you do :
text = (page.extract_text()),
nothing prevents you from splitting "text" into strings (.split())
Now, you can use slicing to retain the text that has value.
Paul
P.S.
Using "extract_words()" may be even quicker.