Jun-13-2023, 12:39 PM
I am trying to create code that will split a pdf into multiple files based upon a regex value in the pdf text. Specifically, I want to split this pdf based into discrete PDFs that represent a patient visit. So my test pdf I see that the office visit date is styled "Visit: ##/##/####". So as the code interates through the pages, I only want a split where that office visit date value changes. And that I want that newly created pdf file(s) to be named with the date of the visit. Here is my code and my errors:
import re from PyPDF2 import PdfReader, PdfWriter def split_pdf_by_date(pdf_path, regex_pattern): # Open the PDF file pdf = PdfReader(pdf_path) # Initialize variables current_date = None output = None # Iterate through each page in the PDF for page_num in range(len(pdf.pages)): # Extract the text from the current page page = pdf.pages[page_num] text = page.extract_text() # Find the date in the text using regex date_match = re.search(regex_pattern, text) if date_match: # Get the date value date = date_match.group() if current_date is None or date != current_date: # Start a new output PDF if the date has changed if output: output_path = f"output_{current_date}.pdf" with open(output_path, "wb") as output_file: output.write(output_file) # Update the current date and create a new PDF writer current_date = date output = PdfWriter() if output: # Add the current page to the output PDF output.add_page(page) # Save the last output PDF if output: output_path = f"output_{current_date}.pdf" with open(output_path, "wb") as output_file: output.write(output_file) print("PDF split completed successfully.") print(output_path) # Print the output path # Example usage pdf_path = "Test.pdf" date_regex = r"Visit: \d{2}/\d{2}/\d{4}" \ split_pdf_by_date(pdf_path, date_regex)
Error:unknown widths :
[0, IndirectObject(3121, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3115, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3110, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3104, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3099, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3051, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3034, 0, 2157813271952)]
Traceback (most recent call last):
File "c:\Users\stand\venv\import PyPDF2.py", line 54, in <module>
split_pdf_by_date(pdf_path, date_regex)
File "c:\Users\stand\venv\import PyPDF2.py", line 30, in split_pdf_by_date
with open(output_path, "wb") as output_file:
^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument: 'output_Visit: 03/23/2023.pdf'
I can see that the first office visit in the target pdf, 3/23/2023 gets found, but that it is about it!