Python Forum
How to remove footer from PDF when extracting to text
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to remove footer from PDF when extracting to text
#1
Hi,
I'm trying to take a footer out of a 550 page pdf and then extract everything left to a .txt file. The extraction is working but the footer is still there. I'm not understanding why this isn't working.

Footer:

Quote:JOHN DOE | List Collection 11 of 550
Proprietary & Confidential

with pdfplumber.open(pdfFilePath) as pdf:
    k = len(pdf.pages)
    for i in range(1, k):
        page = pdf.pages[i]
        text = (page.extract_text())
        footer_pattern = re.search("^JOHN.*Confidential$", text)
        if footer_pattern:
            new_text = text.replace(footer_pattern, '')
            with open(txtFilePath, 'a') as txtFile:
                txtFile.write(new_text)
With this code, the new text file is created but empty. If I change it to
 if footer_pattern:
            text = text.replace(footer_pattern, '')
            with open(txtFilePath, 'a') as txtFile:
                txtFile.write(text)
the footers are still there, I've tried it in a simpler block like this and it seems to work. I think it may have to do with a line break or something but I honestly don't know:

txt = "JOHN DOE | List Collection Proprietary & Confidential"
x = re.search("^JOHN.*Confidential$", txt)

if x:
  newtext = txt.replace(txt, 'success')
  print(newtext)
else:
  print("No match")
Also, I believe the page numbers are separate from the text in the footer and I can't seem to remove those either.
Reply
#2
Try with the re.DOTALL flag
footer_pattern = re.search("(?s)^JOHN.*Confidential$", text)
Reply
#3
(Dec-12-2022, 05:59 PM)Gribouillis Wrote: Try with the re.DOTALL flag
footer_pattern = re.search("(?s)^JOHN.*Confidential$", text)

Just added the (?s) and re.DOTALL and got the same result. Footer is still in the text file. Does this look correct for that flag?

with pdfplumber.open(pdfFilePath) as pdf:
    k = len(pdf.pages)
    for i in range(1, k):
        page = pdf.pages[i]
        text = (page.extract_text())
        footer_pattern = re.search("(?s)^JOHN.*Confidential$", text, re.DOTALL)

        if footer_pattern:
            text = text.replace(footer_pattern, '')

        with open(txtFilePath, 'a') as txtFile:
            txtFile.write(text)
Reply
#4
Hi, I come across this on zillions of documents,
but I use another OCR module.
Still, if you do :
text = (page.extract_text()),
nothing prevents you from splitting "text" into strings (.split())
Now, you can use slicing to retain the text that has value.
Paul
P.S.
Using "extract_words()" may be even quicker.
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to remove patterns of characters from text aaander 4 1,299 Nov-19-2022, 03:34 PM
Last Post: snippsat
  Extracting Specific Lines from text file based on content. jokerfmj 8 3,417 Mar-28-2022, 03:38 PM
Last Post: snippsat
  Extracting all text from a video jehoshua 2 2,325 Nov-14-2021, 09:54 PM
Last Post: jehoshua
  Want to remove the text from a particular column in excel shantanu97 2 2,294 Jul-05-2021, 05:42 PM
Last Post: eddywinch82
  Extracting the text between each "i class" knight2000 4 2,522 May-26-2021, 09:55 AM
Last Post: knight2000
  More elegant way to remove time from text lines. Pedroski55 6 4,232 Apr-25-2021, 03:18 PM
Last Post: perfringo
  Extracting data based on specific patterns in a text file K11 1 2,355 Aug-28-2020, 09:00 AM
Last Post: Gribouillis
  Highlight and remove specific string of text itsalmade 5 3,777 Dec-11-2019, 11:58 PM
Last Post: micseydel
  Extracting Text Evil_Patrick 6 3,198 Nov-13-2019, 08:51 AM
Last Post: buran
  Reg Xpression to remove a text stahorse 2 2,405 May-14-2019, 05:58 AM
Last Post: stahorse

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020