Splitt PDF at regex value

standenman · Jun-13-2023, 06:00 PM

Interesting! Thanks so much for your help and feedback. I just found that the first set of code you gave me just to see if I am getting dates fails on one pdf, but works on another, leading me to question this approach using pypdf2. That is, maybe not a "scalable" solution?

(Jun-13-2023, 04:58 PM)deanhystad Wrote: Divide and conquer. I would first work on the logic that finds all the dates in the PDF and just print them to the screen. Something like this:

import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    current_date = None

    # Iterate through each page in the PDF.  Print all the
    # dates included in the PDF.
    for pagenum, page in enumerate(pdf.pages):
        text = page.extract_text()
        date_match = re.search(date_regex, text)
        if date_match:
            new_date = date_match.group(1).replace("/", "_")
            if new_date != current_date:
                print(pagenum, new_date)
                current_date = new_date


split_pdf_by_date("Test.pdf")

Step through the PDF and verify that all different dates are printed. If this doesn't work, learn why. For example, if there are multiple dates on one page it will only print one.

Once that is working, then you can work on printing the new PDF files. I think your logic for that looks good. You will miss pages at the start of the document until you find a page that has a date. The logic also assumes the date is for the entire page. I tested this on my companies' code of conduct PDF.

import re
from pathlib import Path
from PyPDF2 import PdfReader, PdfWriter

date_regex = re.compile(r"Code of Conduct  /  \d+")


def split_pdf_by_date(input_file, output_path):
    reader = PdfReader(input_file)
    writer = PdfWriter()
    current_date = None

    def write_pages():
        """Write PDFWriter pages to a file."""
        if current_date is None:
            filename = "introduction.pdf"
        else:
            filename = current_date.replace("/", "_").replace(":", "") + ".pdf"
        with open(output_path / filename, "wb") as output_file:
            writer.write(output_file)

    # Iterate through each page in the PDF.  Collect pages
    # in writer.  When date changes, write cached pages to
    # a file named after the current date.
    for page in reader.pages:
        text = page.extract_text()
        if date_match := re.search(date_regex, text):
            new_date = date_match.group()
            if new_date != current_date:
                if writer.getNumPages() > 0:
                    write_pages()
                    writer = PdfWriter()  # No way to flush pages from writer
                current_date = new_date
        writer.add_page(page)

    # Write last date
    if writer.getNumPages() > 0:
        write_pages()


split_pdf_by_date("Test.pdf", Path(__file__).parent / "output files")

I had to change the regex pattern and I modified how the files are named a little, but it worked great.

Splitt PDF at regex value

User Panel Messages

Announcements