How to remove footer from PDF when extracting to text

jh67 · (This post was last modified: Dec-12-2022, 05:20 PM by jh67.)

Hi,
I'm trying to take a footer out of a 550 page pdf and then extract everything left to a .txt file. The extraction is working but the footer is still there. I'm not understanding why this isn't working.

Footer:

Quote:JOHN DOE | List Collection 11 of 550
Proprietary & Confidential

with pdfplumber.open(pdfFilePath) as pdf:
    k = len(pdf.pages)
    for i in range(1, k):
        page = pdf.pages[i]
        text = (page.extract_text())
        footer_pattern = re.search("^JOHN.*Confidential$", text)
        if footer_pattern:
            new_text = text.replace(footer_pattern, '')
            with open(txtFilePath, 'a') as txtFile:
                txtFile.write(new_text)

With this code, the new text file is created but empty. If I change it to

 if footer_pattern:
            text = text.replace(footer_pattern, '')
            with open(txtFilePath, 'a') as txtFile:
                txtFile.write(text)

the footers are still there, I've tried it in a simpler block like this and it seems to work. I think it may have to do with a line break or something but I honestly don't know:

txt = "JOHN DOE | List Collection Proprietary & Confidential"
x = re.search("^JOHN.*Confidential$", txt)

if x:
  newtext = txt.replace(txt, 'success')
  print(newtext)
else:
  print("No match")

Also, I believe the page numbers are separate from the text in the footer and I can't seem to remove those either.

**Gribouillis** · Dec-12-2022, 05:59 PM

Try with the re.DOTALL flag

footer_pattern = re.search("(?s)^JOHN.*Confidential$", text)

jh67 · (This post was last modified: Dec-12-2022, 06:27 PM by jh67.)

(Dec-12-2022, 05:59 PM)Gribouillis Wrote: Try with the re.DOTALL flag
footer_pattern = re.search("(?s)^JOHN.*Confidential$", text)

Just added the (?s) and re.DOTALL and got the same result. Footer is still in the text file. Does this look correct for that flag?

with pdfplumber.open(pdfFilePath) as pdf:
    k = len(pdf.pages)
    for i in range(1, k):
        page = pdf.pages[i]
        text = (page.extract_text())
        footer_pattern = re.search("(?s)^JOHN.*Confidential$", text, re.DOTALL)

        if footer_pattern:
            text = text.replace(footer_pattern, '')

        with open(txtFilePath, 'a') as txtFile:
            txtFile.write(text)

DPaul · Dec-13-2022, 06:52 AM

Hi, I come across this on zillions of documents,
but I use another OCR module.
Still, if you do :
text = (page.extract_text()),
nothing prevents you from splitting "text" into strings (.split())
Now, you can use slicing to retain the text that has value.
Paul
P.S.
Using "extract_words()" may be even quicker.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to remove patterns of characters from text	aaander	4	2,174	Nov-19-2022, 03:34 PM Last Post: snippsat
	Extracting Specific Lines from text file based on content.	jokerfmj	8	5,632	Mar-28-2022, 03:38 PM Last Post: snippsat
	Extracting all text from a video	jehoshua	2	2,968	Nov-14-2021, 09:54 PM Last Post: jehoshua
	Want to remove the text from a particular column in excel	shantanu97	2	2,789	Jul-05-2021, 05:42 PM Last Post: eddywinch82
	Extracting the text between each "i class"	knight2000	4	3,405	May-26-2021, 09:55 AM Last Post: knight2000
	More elegant way to remove time from text lines.	Pedroski55	6	5,435	Apr-25-2021, 03:18 PM Last Post: perfringo
	Extracting data based on specific patterns in a text file	K11	1	2,870	Aug-28-2020, 09:00 AM Last Post: Gribouillis
	Highlight and remove specific string of text	itsalmade	5	4,660	Dec-11-2019, 11:58 PM Last Post: micseydel
	Extracting Text	Evil_Patrick	6	4,127	Nov-13-2019, 08:51 AM Last Post: buran
	Reg Xpression to remove a text	stahorse	2	3,009	May-14-2019, 05:58 AM Last Post: stahorse

How to remove footer from PDF when extracting to text

User Panel Messages

Announcements