Python Forum
Split pdf in pypdf based upon file regex - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Split pdf in pypdf based upon file regex (/thread-39329.html)



Split pdf in pypdf based upon file regex - standenman - Feb-01-2023

I am trying to split a pdf doc that is a set of medical records based upon the date of treatment. So in this pdf of records we have "Visit Date: ##/##/####" that marks the beginning of one or a series of pages of notes for that give date. I want to split the pdf into seperate pdfs for each treatment date. The below code runs and gives me terminal out put of a series of lines either saying "You Failed" or saying something in this form:

[0, IndirectObject(612, 0, 2464980264080)]
unknown widths :

There are no pdf files that I can find are created. What am I doing wrong?
 import re
import pypdf

# Open the PDF file
pdf_file = pypdf.PdfReader(open("Documents/VisitDate.pdf", "rb"))

# Define the regex pattern
pattern = re.compile("Visit Date: ^[0-9]{1,2}\\/[0-9]{1,2}\\/[0-9]{4}$")

# Loop through each page of the PDF
for i in range(len(pdf_file.pages)):
  page = pdf_file.pages[i]
  text = page.extract_text()

  # Check if the regex value is in the page text
  if pattern.search(text):
    # If the regex value is found, create a new PDF file
    output_pdf = pypdf.PdfFileWriter()
    output_pdf.addPage(page)
    with open("output_{}.pdf".format(i), "wb") as output_file:
      output_pdf.write(output_file)
  else: print ("You Failed")



RE: Split pdf in pypdf based upon file regex - SpongeB0B - Feb-03-2023

Are you sure that you regex expression is correct ?

When I tried it's not matching a date like 12/12/1984

maybe you want to try with on less "\"
^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$
Cheers