Dec-12-2022, 05:20 PM
Hi,
I'm trying to take a footer out of a 550 page pdf and then extract everything left to a .txt file. The extraction is working but the footer is still there. I'm not understanding why this isn't working.
Footer:
I'm trying to take a footer out of a 550 page pdf and then extract everything left to a .txt file. The extraction is working but the footer is still there. I'm not understanding why this isn't working.
Footer:
Quote:JOHN DOE | List Collection 11 of 550
Proprietary & Confidential
with pdfplumber.open(pdfFilePath) as pdf: k = len(pdf.pages) for i in range(1, k): page = pdf.pages[i] text = (page.extract_text()) footer_pattern = re.search("^JOHN.*Confidential$", text) if footer_pattern: new_text = text.replace(footer_pattern, '') with open(txtFilePath, 'a') as txtFile: txtFile.write(new_text)With this code, the new text file is created but empty. If I change it to
if footer_pattern: text = text.replace(footer_pattern, '') with open(txtFilePath, 'a') as txtFile: txtFile.write(text)the footers are still there, I've tried it in a simpler block like this and it seems to work. I think it may have to do with a line break or something but I honestly don't know:
txt = "JOHN DOE | List Collection Proprietary & Confidential" x = re.search("^JOHN.*Confidential$", txt) if x: newtext = txt.replace(txt, 'success') print(newtext) else: print("No match")Also, I believe the page numbers are separate from the text in the footer and I can't seem to remove those either.