Splitt PDF at regex value

standenman · Jun-13-2023, 12:39 PM

I am trying to create code that will split a pdf into multiple files based upon a regex value in the pdf text. Specifically, I want to split this pdf based into discrete PDFs that represent a patient visit. So my test pdf I see that the office visit date is styled "Visit: ##/##/####". So as the code interates through the pages, I only want a split where that office visit date value changes. And that I want that newly created pdf file(s) to be named with the date of the visit. Here is my code and my errors:

import re
from PyPDF2 import PdfReader, PdfWriter

def split_pdf_by_date(pdf_path, regex_pattern):
    # Open the PDF file
    pdf = PdfReader(pdf_path)

    # Initialize variables
    current_date = None
    output = None

    # Iterate through each page in the PDF
    for page_num in range(len(pdf.pages)):
        # Extract the text from the current page
        page = pdf.pages[page_num]
        text = page.extract_text()

        # Find the date in the text using regex
        date_match = re.search(regex_pattern, text)

        if date_match:
            # Get the date value
            date = date_match.group()

            if current_date is None or date != current_date:
                # Start a new output PDF if the date has changed
                if output:
                    output_path = f"output_{current_date}.pdf"
                    with open(output_path, "wb") as output_file:
                        output.write(output_file)

                # Update the current date and create a new PDF writer
                current_date = date
                output = PdfWriter()

        if output:
            # Add the current page to the output PDF
            output.add_page(page)

    # Save the last output PDF
    if output:
        output_path = f"output_{current_date}.pdf"
        with open(output_path, "wb") as output_file:
            output.write(output_file)

        print("PDF split completed successfully.")
        print(output_path)  # Print the output path

# Example usage
pdf_path = "Test.pdf"
date_regex = r"Visit: \d{2}/\d{2}/\d{4}" \

split_pdf_by_date(pdf_path, date_regex)

Error:unknown widths : 
[0, IndirectObject(3121, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3115, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3110, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3104, 0, 2157813271952)]
unknown widths :
[0, IndirectObject(3099, 0, 2157813271952)]
unknown widths : 
[0, IndirectObject(3051, 0, 2157813271952)]
unknown widths : 
[0, IndirectObject(3034, 0, 2157813271952)]
Traceback (most recent call last):
  File "c:\Users\stand\venv\import PyPDF2.py", line 54, in <module>
    split_pdf_by_date(pdf_path, date_regex)
  File "c:\Users\stand\venv\import PyPDF2.py", line 30, in split_pdf_by_date
    with open(output_path, "wb") as output_file:
         ^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument: 'output_Visit: 03/23/2023.pdf'

I can see that the first office visit in the target pdf, 3/23/2023 gets found, but that it is about it!

**deanhystad** · (This post was last modified: Jun-13-2023, 01:42 PM by deanhystad.)

Your problem is that the filename for your new file is invalid. Your regex splitting probably works fine.

'output_Visit: 03/23/2023.pdf' is not a valid filename. You cannot have "/" in a filename. The colon is also a bad choice. On windows, creating a file named "output_Visit: some date.pdf" results in a file named "output_Visit".

You need to process the date, maybe changing "/" to "_", and removing the colon and any spaces.

standenman · Jun-13-2023, 02:41 PM

OK. Thanks very much for your help. So eliminating the "/" in file name yes code runs but makes only one new file. But in ths target pdf we have 4 or 5 office visits - changes in the value of "Visit:". And the split did not occur at the first change in the regex. Just seemed kind of random.

I am missing something here. It is like the code is not iterating through to make X number of new pdfs based upon X changes in the regex.

(Jun-13-2023, 01:42 PM)deanhystad Wrote: Your problem is that the filename for your new file is invalid. Your regex splitting probably works fine.

'output_Visit: 03/23/2023.pdf' is not a valid filename. You cannot have "/" in a filename. The colon is also a bad choice. On windows, creating a file named "output_Visit: some date.pdf" results in a file named "output_Visit".

You need to process the date, maybe changing "/" to "_", and removing the colon and any spaces.

**deanhystad** · (This post was last modified: Jun-13-2023, 04:58 PM by deanhystad.)

Divide and conquer. I would first work on the logic that finds all the dates in the PDF and just print them to the screen. Something like this:

import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    current_date = None

    # Iterate through each page in the PDF.  Print all the
    # dates included in the PDF.
    for pagenum, page in enumerate(pdf.pages):
        text = page.extract_text()
        date_match = re.search(date_regex, text)
        if date_match:
            new_date = date_match.group(1).replace("/", "_")
            if new_date != current_date:
                print(pagenum, new_date)
                current_date = new_date


split_pdf_by_date("Test.pdf")

Step through the PDF and verify that all different dates are printed. If this doesn't work, learn why. For example, if there are multiple dates on one page it will only print one.

Once that is working, then you can work on printing the new PDF files. I think your logic for that looks good. You will miss pages at the start of the document until you find a page that has a date. The logic also assumes the date is for the entire page. I tested this on my companies' code of conduct PDF.

import re
from pathlib import Path
from PyPDF2 import PdfReader, PdfWriter

date_regex = re.compile(r"Code of Conduct  /  \d+")


def split_pdf_by_date(input_file, output_path):
    reader = PdfReader(input_file)
    writer = PdfWriter()
    current_date = None

    def write_pages():
        """Write PDFWriter pages to a file."""
        if current_date is None:
            filename = "introduction.pdf"
        else:
            filename = current_date.replace("/", "_").replace(":", "") + ".pdf"
        with open(output_path / filename, "wb") as output_file:
            writer.write(output_file)

    # Iterate through each page in the PDF.  Collect pages
    # in writer.  When date changes, write cached pages to
    # a file named after the current date.
    for page in reader.pages:
        text = page.extract_text()
        if date_match := re.search(date_regex, text):
            new_date = date_match.group()
            if new_date != current_date:
                if writer.getNumPages() > 0:
                    write_pages()
                    writer = PdfWriter()  # No way to flush pages from writer
                current_date = new_date
        writer.add_page(page)

    # Write last date
    if writer.getNumPages() > 0:
        write_pages()


split_pdf_by_date("Test.pdf", Path(__file__).parent / "output files")

I had to change the regex pattern and I modified how the files are named a little, but it worked great.

standenman · Jun-13-2023, 06:00 PM

Interesting! Thanks so much for your help and feedback. I just found that the first set of code you gave me just to see if I am getting dates fails on one pdf, but works on another, leading me to question this approach using pypdf2. That is, maybe not a "scalable" solution?

(Jun-13-2023, 04:58 PM)deanhystad Wrote: Divide and conquer. I would first work on the logic that finds all the dates in the PDF and just print them to the screen. Something like this:

import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    current_date = None

    # Iterate through each page in the PDF.  Print all the
    # dates included in the PDF.
    for pagenum, page in enumerate(pdf.pages):
        text = page.extract_text()
        date_match = re.search(date_regex, text)
        if date_match:
            new_date = date_match.group(1).replace("/", "_")
            if new_date != current_date:
                print(pagenum, new_date)
                current_date = new_date


split_pdf_by_date("Test.pdf")

Step through the PDF and verify that all different dates are printed. If this doesn't work, learn why. For example, if there are multiple dates on one page it will only print one.

Once that is working, then you can work on printing the new PDF files. I think your logic for that looks good. You will miss pages at the start of the document until you find a page that has a date. The logic also assumes the date is for the entire page. I tested this on my companies' code of conduct PDF.

import re
from pathlib import Path
from PyPDF2 import PdfReader, PdfWriter

date_regex = re.compile(r"Code of Conduct  /  \d+")


def split_pdf_by_date(input_file, output_path):
    reader = PdfReader(input_file)
    writer = PdfWriter()
    current_date = None

    def write_pages():
        """Write PDFWriter pages to a file."""
        if current_date is None:
            filename = "introduction.pdf"
        else:
            filename = current_date.replace("/", "_").replace(":", "") + ".pdf"
        with open(output_path / filename, "wb") as output_file:
            writer.write(output_file)

    # Iterate through each page in the PDF.  Collect pages
    # in writer.  When date changes, write cached pages to
    # a file named after the current date.
    for page in reader.pages:
        text = page.extract_text()
        if date_match := re.search(date_regex, text):
            new_date = date_match.group()
            if new_date != current_date:
                if writer.getNumPages() > 0:
                    write_pages()
                    writer = PdfWriter()  # No way to flush pages from writer
                current_date = new_date
        writer.add_page(page)

    # Write last date
    if writer.getNumPages() > 0:
        write_pages()


split_pdf_by_date("Test.pdf", Path(__file__).parent / "output files")

I had to change the regex pattern and I modified how the files are named a little, but it worked great.

**deanhystad** · Jun-13-2023, 06:14 PM

Output:
I just found that the first set of code you gave me just to see if I am getting dates fails on one pdf, but works on another, leading me to question this approach using pypdf2.

Why did it fail?

standenman · Jun-13-2023, 07:03 PM

Gives me this stuff:

Error:[0, IndirectObject(3121, 0, 2465755860368)]
unknown widths :
[0, IndirectObject(3115, 0, 2465755860368)]
unknown widths :
[0, IndirectObject(3110, 0, 2465755860368)]
unknown widths :
[0, IndirectObject(3104, 0, 2465755860368)]
unknown widths :
[0, IndirectObject(3099, 0, 2465755860368)]
unknown widths : 
[0, IndirectObject(3051, 0, 2465755860368)]
unknown widths : 
[0, IndirectObject(3034, 0, 2465755860368)]

(Jun-13-2023, 06:14 PM)deanhystad Wrote:

Output:
I just found that the first set of code you gave me just to see if I am getting dates fails on one pdf, but works on another, leading me to question this approach using pypdf2.

Why did it fail?

**deanhystad** · Jun-13-2023, 07:14 PM

What "gives you this stuff"?

Those are not error messages. Is your program printing something, or are this a message from PyPDF2? When are they printed? Are these output when you try to extract text from a page?

I would try something like this to diagnose.

import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    for pagenum, page in enumerate(pdf.pages, start=1):
        print("Page", pagenum)
        text = page.extract_text()
        print("Search")
        print(re.search(date_regex, text), "\n")


split_pdf_by_date("Test.pdf")

standenman · Jun-13-2023, 07:16 PM

OK. I will try it.

(Jun-13-2023, 07:14 PM)deanhystad Wrote: What "gives you this stuff"?

Those are not error messages. Is your program printing something, or are this a message from PyPDF2? When are they printed? Are these output when you try to extract text from a page?

I would try something like this to diagnose.
import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    for pagenum, page in enumerate(pdf.pages, start=1):
        print("Page", pagenum)
        text = page.extract_text()
        print("Search")
        print(re.search(date_regex, text), "\n")


split_pdf_by_date("Test.pdf")

standenman · Jun-13-2023, 09:37 PM

Here's my output:

Error:Page 1
unknown widths :
[0, IndirectObject(3121, 0, 2905784995472)]
unknown widths :
[0, IndirectObject(3115, 0, 2905784995472)]
unknown widths :
[0, IndirectObject(3110, 0, 2905784995472)]
unknown widths :
[0, IndirectObject(3104, 0, 2905784995472)]
unknown widths :
[0, IndirectObject(3099, 0, 2905784995472)]
Search
None

Page 2
Search
<re.Match object; span=(193, 210), match='Visit: 03/23/2023'>

Page 3
Search
<re.Match object; span=(221, 238), match='Visit: 03/23/2023'>

Page 4
Search
<re.Match object; span=(228, 245), match='Visit: 03/23/2023'>

Page 5
unknown widths :
[0, IndirectObject(3051, 0, 2905784995472)]
Search
<re.Match object; span=(452, 469), match='Visit: 03/23/2023'>

Page 6
unknown widths : 
[0, IndirectObject(3034, 0, 2905784995472)]
Search
<re.Match object; span=(193, 210), match='Visit: 03/23/2023'>

Page 7
Search
<re.Match object; span=(193, 210), match='Visit: 03/23/2023'>

Page 8
Search
<re.Match object; span=(193, 210), match='Visit: 03/23/2023'> 

Page 9
Search
<re.Match object; span=(193, 210), match='Visit: 03/23/2023'>

Page 10
Search
<re.Match object; span=(194, 211), match='Visit: 12/29/2022'>

Page 11
Search
<re.Match object; span=(222, 239), match='Visit: 12/29/2022'>

Page 12
Search
<re.Match object; span=(229, 246), match='Visit: 12/29/2022'>

Page 13
Search
<re.Match object; span=(453, 470), match='Visit: 12/29/2022'>

Page 14
Search
<re.Match object; span=(194, 211), match='Visit: 12/29/2022'>

Page 15
Search
<re.Match object; span=(194, 211), match='Visit: 12/29/2022'>

Page 16
Search
<re.Match object; span=(194, 211), match='Visit: 12/29/2022'>

Page 17
unknown widths : 
[0, IndirectObject(2858, 0, 2905784995472)]
Search
<re.Match object; span=(224, 241), match='Visit: 12/29/2022'>

Page 18
Search
<re.Match object; span=(194, 211), match='Visit: 11/15/2022'>

Page 19
unknown widths :
[0, IndirectObject(2826, 0, 2905784995472)]
Search
<re.Match object; span=(222, 239), match='Visit: 11/15/2022'>

Page 20
Search
<re.Match object; span=(229, 246), match='Visit: 11/15/2022'>

Page 21
Search
<re.Match object; span=(453, 470), match='Visit: 11/15/2022'>

Page 22
Search
<re.Match object; span=(194, 211), match='Visit: 11/15/2022'>

Page 23
Search
<re.Match object; span=(194, 211), match='Visit: 11/15/2022'>

Page 24
unknown widths :
[0, IndirectObject(2765, 0, 2905784995472)]
unknown widths : 
[0, IndirectObject(2756, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 11/15/2022'>

Page 25
Search
<re.Match object; span=(223, 240), match='Visit: 11/15/2022'>

Page 26
Search
<re.Match object; span=(194, 211), match='Visit: 09/20/2022'>

Page 27
Search
<re.Match object; span=(222, 239), match='Visit: 09/20/2022'>

Page 28
Search
<re.Match object; span=(229, 246), match='Visit: 09/20/2022'>

Page 29
Search
<re.Match object; span=(453, 470), match='Visit: 09/20/2022'>

Page 30
Search
<re.Match object; span=(194, 211), match='Visit: 09/20/2022'>

Page 31
Search
<re.Match object; span=(194, 211), match='Visit: 09/20/2022'>

Page 32
Search
<re.Match object; span=(194, 211), match='Visit: 09/20/2022'>

Page 33
Search
<re.Match object; span=(224, 241), match='Visit: 09/20/2022'>

Page 34
Search
<re.Match object; span=(297, 314), match='Visit: 08/17/2022'>

Page 35
Search
<re.Match object; span=(291, 308), match='Visit: 08/17/2022'>

Page 36
Search
<re.Match object; span=(194, 211), match='Visit: 08/17/2022'>

Page 37
Search
<re.Match object; span=(572, 589), match='Visit: 08/17/2022'>

Page 38
unknown widths :
[0, IndirectObject(2576, 0, 2905784995472)]
unknown widths :
[0, IndirectObject(2565, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 08/17/2022'>

Page 39
Search
<re.Match object; span=(194, 211), match='Visit: 08/17/2022'>

Page 40
Search
<re.Match object; span=(194, 211), match='Visit: 08/17/2022'>

Page 41
unknown widths : 
[0, IndirectObject(2514, 0, 2905784995472)]
unknown widths :
[0, IndirectObject(2509, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 08/17/2022'>

Page 42
Search
<re.Match object; span=(289, 306), match='Visit: 07/20/2022'>

Page 43
Search
<re.Match object; span=(250, 267), match='Visit: 07/20/2022'>

Page 44
Search
<re.Match object; span=(194, 211), match='Visit: 07/20/2022'>

Page 45
Search
<re.Match object; span=(572, 589), match='Visit: 07/20/2022'>

Page 46
Search
<re.Match object; span=(194, 211), match='Visit: 07/20/2022'>

Page 47
unknown widths :
[0, IndirectObject(2428, 0, 2905784995472)]
unknown widths :
[0, IndirectObject(2420, 0, 2905784995472)]
unknown widths :
[0, IndirectObject(2415, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 07/20/2022'>

Page 48
Search
<re.Match object; span=(194, 211), match='Visit: 07/20/2022'>

Page 49
Search
<re.Match object; span=(194, 211), match='Visit: 07/20/2022'>

Page 50
Search
<re.Match object; span=(289, 306), match='Visit: 06/22/2022'>

Page 51
Search
<re.Match object; span=(250, 267), match='Visit: 06/22/2022'>

Page 52
Search
<re.Match object; span=(194, 211), match='Visit: 06/22/2022'>

Page 53
Search
<re.Match object; span=(560, 577), match='Visit: 06/22/2022'>

Page 54
Search
<re.Match object; span=(194, 211), match='Visit: 06/22/2022'>

Page 55
unknown widths :
[0, IndirectObject(2292, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 06/22/2022'>

Page 56
Search
<re.Match object; span=(194, 211), match='Visit: 06/22/2022'>

Page 57
Search
<re.Match object; span=(194, 211), match='Visit: 06/22/2022'> 

Page 58
Search
<re.Match object; span=(251, 268), match='Visit: 05/18/2022'>

Page 59
Search
<re.Match object; span=(235, 252), match='Visit: 05/18/2022'>

Page 60
Search
<re.Match object; span=(194, 211), match='Visit: 05/18/2022'>

Page 61
Search
<re.Match object; span=(584, 601), match='Visit: 05/18/2022'>

Page 62
Search
<re.Match object; span=(251, 268), match='Visit: 05/18/2022'>

Page 63
unknown widths :
[0, IndirectObject(2173, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 05/18/2022'>

Page 64
Search
<re.Match object; span=(194, 211), match='Visit: 05/18/2022'>

Page 65
unknown widths : 
[0, IndirectObject(2124, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 05/18/2022'>

Page 66
unknown widths :
[0, IndirectObject(2103, 0, 2905784995472)]
Search
<re.Match object; span=(290, 307), match='Visit: 04/19/2022'>

Page 67
unknown widths : 
[0, IndirectObject(2086, 0, 2905784995472)]
Search
<re.Match object; span=(313, 330), match='Visit: 04/19/2022'>

Page 68
Search
<re.Match object; span=(233, 250), match='Visit: 04/19/2022'>

Page 69
Search
<re.Match object; span=(657, 674), match='Visit: 04/19/2022'>

Page 70
unknown widths :
[0, IndirectObject(2051, 0, 2905784995472)]
unknown widths : 
[0, IndirectObject(2042, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 04/19/2022'> 

Page 71
Search
<re.Match object; span=(267, 284), match='Visit: 04/19/2022'>

Page 72
Search
<re.Match object; span=(194, 211), match='Visit: 04/19/2022'>

Page 73
unknown widths : 
[0, IndirectObject(1996, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 04/19/2022'>

Page 74
Search
<re.Match object; span=(193, 210), match='Visit: 04/19/2022'>

Page 75
Search
<re.Match object; span=(287, 304), match='Visit: 03/22/2022'>

Page 76
Search
<re.Match object; span=(235, 252), match='Visit: 03/22/2022'>

Page 77
Search
<re.Match object; span=(194, 211), match='Visit: 03/22/2022'>

Page 78
Search
<re.Match object; span=(584, 601), match='Visit: 03/22/2022'>

Page 79
Search
<re.Match object; span=(251, 268), match='Visit: 03/22/2022'>

Page 80
Search
<re.Match object; span=(194, 211), match='Visit: 03/22/2022'>

Page 81
Search
<re.Match object; span=(194, 211), match='Visit: 03/22/2022'>

Page 82
unknown widths : 
[0, IndirectObject(1845, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 03/22/2022'>

Page 83
unknown widths : 
[0, IndirectObject(1820, 0, 2905784995472)]
Search
<re.Match object; span=(507, 524), match='Visit: 03/22/2022'>

Page 84
Search
<re.Match object; span=(255, 272), match='Visit: 02/03/2022'>

Page 85
Search
<re.Match object; span=(194, 211), match='Visit: 02/03/2022'>

Page 86
Search
<re.Match object; span=(194, 211), match='Visit: 02/03/2022'>

Page 87
Search
<re.Match object; span=(254, 271), match='Visit: 02/03/2022'>

Page 88
Search
<re.Match object; span=(194, 211), match='Visit: 02/03/2022'> 

Page 89
Search
<re.Match object; span=(560, 577), match='Visit: 02/03/2022'>

Page 90
unknown widths :
[0, IndirectObject(1705, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 02/03/2022'>

Page 91
Search
<re.Match object; span=(194, 211), match='Visit: 02/03/2022'>

Page 92
Search
<re.Match object; span=(193, 210), match='Visit: 02/03/2022'>

Page 93
unknown widths : 
[0, IndirectObject(1605, 0, 2905784995472)]
Search
<re.Match object; span=(194, 211), match='Visit: 02/03/2022'>

(venv) PS C:\Users\stand\venv>

(Jun-13-2023, 07:14 PM)deanhystad Wrote: What "gives you this stuff"?

Those are not error messages. Is your program printing something, or are this a message from PyPDF2? When are they printed? Are these output when you try to extract text from a page?

I would try something like this to diagnose.
import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    for pagenum, page in enumerate(pdf.pages, start=1):
        print("Page", pagenum)
        text = page.extract_text()
        print("Search")
        print(re.search(date_regex, text), "\n")


split_pdf_by_date("Test.pdf")

Splitt PDF at regex value

User Panel Messages

Announcements