Python Forum
Splitt PDF at regex value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Splitt PDF at regex value
#1
I am trying to create code that will split a pdf into multiple files based upon a regex value in the pdf text. Specifically, I want to split this pdf based into discrete PDFs that represent a patient visit. So my test pdf I see that the office visit date is styled "Visit: ##/##/####". So as the code interates through the pages, I only want a split where that office visit date value changes. And that I want that newly created pdf file(s) to be named with the date of the visit. Here is my code and my errors:

import re
from PyPDF2 import PdfReader, PdfWriter

def split_pdf_by_date(pdf_path, regex_pattern):
    # Open the PDF file
    pdf = PdfReader(pdf_path)

    # Initialize variables
    current_date = None
    output = None

    # Iterate through each page in the PDF
    for page_num in range(len(pdf.pages)):
        # Extract the text from the current page
        page = pdf.pages[page_num]
        text = page.extract_text()

        # Find the date in the text using regex
        date_match = re.search(regex_pattern, text)

        if date_match:
            # Get the date value
            date = date_match.group()

            if current_date is None or date != current_date:
                # Start a new output PDF if the date has changed
                if output:
                    output_path = f"output_{current_date}.pdf"
                    with open(output_path, "wb") as output_file:
                        output.write(output_file)

                # Update the current date and create a new PDF writer
                current_date = date
                output = PdfWriter()

        if output:
            # Add the current page to the output PDF
            output.add_page(page)

    # Save the last output PDF
    if output:
        output_path = f"output_{current_date}.pdf"
        with open(output_path, "wb") as output_file:
            output.write(output_file)

        print("PDF split completed successfully.")
        print(output_path)  # Print the output path

# Example usage
pdf_path = "Test.pdf"
date_regex = r"Visit: \d{2}/\d{2}/\d{4}" \

split_pdf_by_date(pdf_path, date_regex)
Error:
unknown widths : [0, IndirectObject(3121, 0, 2157813271952)] unknown widths : [0, IndirectObject(3115, 0, 2157813271952)] unknown widths : [0, IndirectObject(3110, 0, 2157813271952)] unknown widths : [0, IndirectObject(3104, 0, 2157813271952)] unknown widths : [0, IndirectObject(3099, 0, 2157813271952)] unknown widths : [0, IndirectObject(3051, 0, 2157813271952)] unknown widths : [0, IndirectObject(3034, 0, 2157813271952)] Traceback (most recent call last): File "c:\Users\stand\venv\import PyPDF2.py", line 54, in <module> split_pdf_by_date(pdf_path, date_regex) File "c:\Users\stand\venv\import PyPDF2.py", line 30, in split_pdf_by_date with open(output_path, "wb") as output_file: ^^^^^^^^^^^^^^^^^^^^^^^ OSError: [Errno 22] Invalid argument: 'output_Visit: 03/23/2023.pdf'
I can see that the first office visit in the target pdf, 3/23/2023 gets found, but that it is about it!
Reply
#2
Your problem is that the filename for your new file is invalid. Your regex splitting probably works fine.

'output_Visit: 03/23/2023.pdf' is not a valid filename. You cannot have "/" in a filename. The colon is also a bad choice. On windows, creating a file named "output_Visit: some date.pdf" results in a file named "output_Visit".

You need to process the date, maybe changing "/" to "_", and removing the colon and any spaces.
Reply
#3
OK. Thanks very much for your help. So eliminating the "/" in file name yes code runs but makes only one new file. But in ths target pdf we have 4 or 5 office visits - changes in the value of "Visit:". And the split did not occur at the first change in the regex. Just seemed kind of random.

I am missing something here. It is like the code is not iterating through to make X number of new pdfs based upon X changes in the regex.

(Jun-13-2023, 01:42 PM)deanhystad Wrote: Your problem is that the filename for your new file is invalid. Your regex splitting probably works fine.

'output_Visit: 03/23/2023.pdf' is not a valid filename. You cannot have "/" in a filename. The colon is also a bad choice. On windows, creating a file named "output_Visit: some date.pdf" results in a file named "output_Visit".

You need to process the date, maybe changing "/" to "_", and removing the colon and any spaces.
Reply
#4
Divide and conquer. I would first work on the logic that finds all the dates in the PDF and just print them to the screen. Something like this:
import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    current_date = None

    # Iterate through each page in the PDF.  Print all the
    # dates included in the PDF.
    for pagenum, page in enumerate(pdf.pages):
        text = page.extract_text()
        date_match = re.search(date_regex, text)
        if date_match:
            new_date = date_match.group(1).replace("/", "_")
            if new_date != current_date:
                print(pagenum, new_date)
                current_date = new_date


split_pdf_by_date("Test.pdf")
Step through the PDF and verify that all different dates are printed. If this doesn't work, learn why. For example, if there are multiple dates on one page it will only print one.

Once that is working, then you can work on printing the new PDF files. I think your logic for that looks good. You will miss pages at the start of the document until you find a page that has a date. The logic also assumes the date is for the entire page. I tested this on my companies' code of conduct PDF.
import re
from pathlib import Path
from PyPDF2 import PdfReader, PdfWriter

date_regex = re.compile(r"Code of Conduct  /  \d+")


def split_pdf_by_date(input_file, output_path):
    reader = PdfReader(input_file)
    writer = PdfWriter()
    current_date = None

    def write_pages():
        """Write PDFWriter pages to a file."""
        if current_date is None:
            filename = "introduction.pdf"
        else:
            filename = current_date.replace("/", "_").replace(":", "") + ".pdf"
        with open(output_path / filename, "wb") as output_file:
            writer.write(output_file)

    # Iterate through each page in the PDF.  Collect pages
    # in writer.  When date changes, write cached pages to
    # a file named after the current date.
    for page in reader.pages:
        text = page.extract_text()
        if date_match := re.search(date_regex, text):
            new_date = date_match.group()
            if new_date != current_date:
                if writer.getNumPages() > 0:
                    write_pages()
                    writer = PdfWriter()  # No way to flush pages from writer
                current_date = new_date
        writer.add_page(page)

    # Write last date
    if writer.getNumPages() > 0:
        write_pages()


split_pdf_by_date("Test.pdf", Path(__file__).parent / "output files")
I had to change the regex pattern and I modified how the files are named a little, but it worked great.
Reply
#5
Interesting! Thanks so much for your help and feedback. I just found that the first set of code you gave me just to see if I am getting dates fails on one pdf, but works on another, leading me to question this approach using pypdf2. That is, maybe not a "scalable" solution?

(Jun-13-2023, 04:58 PM)deanhystad Wrote: Divide and conquer. I would first work on the logic that finds all the dates in the PDF and just print them to the screen. Something like this:
import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    current_date = None

    # Iterate through each page in the PDF.  Print all the
    # dates included in the PDF.
    for pagenum, page in enumerate(pdf.pages):
        text = page.extract_text()
        date_match = re.search(date_regex, text)
        if date_match:
            new_date = date_match.group(1).replace("/", "_")
            if new_date != current_date:
                print(pagenum, new_date)
                current_date = new_date


split_pdf_by_date("Test.pdf")
Step through the PDF and verify that all different dates are printed. If this doesn't work, learn why. For example, if there are multiple dates on one page it will only print one.

Once that is working, then you can work on printing the new PDF files. I think your logic for that looks good. You will miss pages at the start of the document until you find a page that has a date. The logic also assumes the date is for the entire page. I tested this on my companies' code of conduct PDF.
import re
from pathlib import Path
from PyPDF2 import PdfReader, PdfWriter

date_regex = re.compile(r"Code of Conduct  /  \d+")


def split_pdf_by_date(input_file, output_path):
    reader = PdfReader(input_file)
    writer = PdfWriter()
    current_date = None

    def write_pages():
        """Write PDFWriter pages to a file."""
        if current_date is None:
            filename = "introduction.pdf"
        else:
            filename = current_date.replace("/", "_").replace(":", "") + ".pdf"
        with open(output_path / filename, "wb") as output_file:
            writer.write(output_file)

    # Iterate through each page in the PDF.  Collect pages
    # in writer.  When date changes, write cached pages to
    # a file named after the current date.
    for page in reader.pages:
        text = page.extract_text()
        if date_match := re.search(date_regex, text):
            new_date = date_match.group()
            if new_date != current_date:
                if writer.getNumPages() > 0:
                    write_pages()
                    writer = PdfWriter()  # No way to flush pages from writer
                current_date = new_date
        writer.add_page(page)

    # Write last date
    if writer.getNumPages() > 0:
        write_pages()


split_pdf_by_date("Test.pdf", Path(__file__).parent / "output files")
I had to change the regex pattern and I modified how the files are named a little, but it worked great.
Reply
#6
Output:
I just found that the first set of code you gave me just to see if I am getting dates fails on one pdf, but works on another, leading me to question this approach using pypdf2.
Why did it fail?
Reply
#7
Gives me this stuff:
Error:
[0, IndirectObject(3121, 0, 2465755860368)] unknown widths : [0, IndirectObject(3115, 0, 2465755860368)] unknown widths : [0, IndirectObject(3110, 0, 2465755860368)] unknown widths : [0, IndirectObject(3104, 0, 2465755860368)] unknown widths : [0, IndirectObject(3099, 0, 2465755860368)] unknown widths : [0, IndirectObject(3051, 0, 2465755860368)] unknown widths : [0, IndirectObject(3034, 0, 2465755860368)]
(Jun-13-2023, 06:14 PM)deanhystad Wrote:
Output:
I just found that the first set of code you gave me just to see if I am getting dates fails on one pdf, but works on another, leading me to question this approach using pypdf2.
Why did it fail?
Reply
#8
What "gives you this stuff"?

Those are not error messages. Is your program printing something, or are this a message from PyPDF2? When are they printed? Are these output when you try to extract text from a page?

I would try something like this to diagnose.
import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    for pagenum, page in enumerate(pdf.pages, start=1):
        print("Page", pagenum)
        text = page.extract_text()
        print("Search")
        print(re.search(date_regex, text), "\n")


split_pdf_by_date("Test.pdf")
Reply
#9
OK. I will try it.

(Jun-13-2023, 07:14 PM)deanhystad Wrote: What "gives you this stuff"?

Those are not error messages. Is your program printing something, or are this a message from PyPDF2? When are they printed? Are these output when you try to extract text from a page?

I would try something like this to diagnose.
import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    for pagenum, page in enumerate(pdf.pages, start=1):
        print("Page", pagenum)
        text = page.extract_text()
        print("Search")
        print(re.search(date_regex, text), "\n")


split_pdf_by_date("Test.pdf")
Reply
#10
Here's my output:

Error:
Page 1 unknown widths : [0, IndirectObject(3121, 0, 2905784995472)] unknown widths : [0, IndirectObject(3115, 0, 2905784995472)] unknown widths : [0, IndirectObject(3110, 0, 2905784995472)] unknown widths : [0, IndirectObject(3104, 0, 2905784995472)] unknown widths : [0, IndirectObject(3099, 0, 2905784995472)] Search None Page 2 Search <re.Match object; span=(193, 210), match='Visit: 03/23/2023'> Page 3 Search <re.Match object; span=(221, 238), match='Visit: 03/23/2023'> Page 4 Search <re.Match object; span=(228, 245), match='Visit: 03/23/2023'> Page 5 unknown widths : [0, IndirectObject(3051, 0, 2905784995472)] Search <re.Match object; span=(452, 469), match='Visit: 03/23/2023'> Page 6 unknown widths : [0, IndirectObject(3034, 0, 2905784995472)] Search <re.Match object; span=(193, 210), match='Visit: 03/23/2023'> Page 7 Search <re.Match object; span=(193, 210), match='Visit: 03/23/2023'> Page 8 Search <re.Match object; span=(193, 210), match='Visit: 03/23/2023'> Page 9 Search <re.Match object; span=(193, 210), match='Visit: 03/23/2023'> Page 10 Search <re.Match object; span=(194, 211), match='Visit: 12/29/2022'> Page 11 Search <re.Match object; span=(222, 239), match='Visit: 12/29/2022'> Page 12 Search <re.Match object; span=(229, 246), match='Visit: 12/29/2022'> Page 13 Search <re.Match object; span=(453, 470), match='Visit: 12/29/2022'> Page 14 Search <re.Match object; span=(194, 211), match='Visit: 12/29/2022'> Page 15 Search <re.Match object; span=(194, 211), match='Visit: 12/29/2022'> Page 16 Search <re.Match object; span=(194, 211), match='Visit: 12/29/2022'> Page 17 unknown widths : [0, IndirectObject(2858, 0, 2905784995472)] Search <re.Match object; span=(224, 241), match='Visit: 12/29/2022'> Page 18 Search <re.Match object; span=(194, 211), match='Visit: 11/15/2022'> Page 19 unknown widths : [0, IndirectObject(2826, 0, 2905784995472)] Search <re.Match object; span=(222, 239), match='Visit: 11/15/2022'> Page 20 Search <re.Match object; span=(229, 246), match='Visit: 11/15/2022'> Page 21 Search <re.Match object; span=(453, 470), match='Visit: 11/15/2022'> Page 22 Search <re.Match object; span=(194, 211), match='Visit: 11/15/2022'> Page 23 Search <re.Match object; span=(194, 211), match='Visit: 11/15/2022'> Page 24 unknown widths : [0, IndirectObject(2765, 0, 2905784995472)] unknown widths : [0, IndirectObject(2756, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 11/15/2022'> Page 25 Search <re.Match object; span=(223, 240), match='Visit: 11/15/2022'> Page 26 Search <re.Match object; span=(194, 211), match='Visit: 09/20/2022'> Page 27 Search <re.Match object; span=(222, 239), match='Visit: 09/20/2022'> Page 28 Search <re.Match object; span=(229, 246), match='Visit: 09/20/2022'> Page 29 Search <re.Match object; span=(453, 470), match='Visit: 09/20/2022'> Page 30 Search <re.Match object; span=(194, 211), match='Visit: 09/20/2022'> Page 31 Search <re.Match object; span=(194, 211), match='Visit: 09/20/2022'> Page 32 Search <re.Match object; span=(194, 211), match='Visit: 09/20/2022'> Page 33 Search <re.Match object; span=(224, 241), match='Visit: 09/20/2022'> Page 34 Search <re.Match object; span=(297, 314), match='Visit: 08/17/2022'> Page 35 Search <re.Match object; span=(291, 308), match='Visit: 08/17/2022'> Page 36 Search <re.Match object; span=(194, 211), match='Visit: 08/17/2022'> Page 37 Search <re.Match object; span=(572, 589), match='Visit: 08/17/2022'> Page 38 unknown widths : [0, IndirectObject(2576, 0, 2905784995472)] unknown widths : [0, IndirectObject(2565, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 08/17/2022'> Page 39 Search <re.Match object; span=(194, 211), match='Visit: 08/17/2022'> Page 40 Search <re.Match object; span=(194, 211), match='Visit: 08/17/2022'> Page 41 unknown widths : [0, IndirectObject(2514, 0, 2905784995472)] unknown widths : [0, IndirectObject(2509, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 08/17/2022'> Page 42 Search <re.Match object; span=(289, 306), match='Visit: 07/20/2022'> Page 43 Search <re.Match object; span=(250, 267), match='Visit: 07/20/2022'> Page 44 Search <re.Match object; span=(194, 211), match='Visit: 07/20/2022'> Page 45 Search <re.Match object; span=(572, 589), match='Visit: 07/20/2022'> Page 46 Search <re.Match object; span=(194, 211), match='Visit: 07/20/2022'> Page 47 unknown widths : [0, IndirectObject(2428, 0, 2905784995472)] unknown widths : [0, IndirectObject(2420, 0, 2905784995472)] unknown widths : [0, IndirectObject(2415, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 07/20/2022'> Page 48 Search <re.Match object; span=(194, 211), match='Visit: 07/20/2022'> Page 49 Search <re.Match object; span=(194, 211), match='Visit: 07/20/2022'> Page 50 Search <re.Match object; span=(289, 306), match='Visit: 06/22/2022'> Page 51 Search <re.Match object; span=(250, 267), match='Visit: 06/22/2022'> Page 52 Search <re.Match object; span=(194, 211), match='Visit: 06/22/2022'> Page 53 Search <re.Match object; span=(560, 577), match='Visit: 06/22/2022'> Page 54 Search <re.Match object; span=(194, 211), match='Visit: 06/22/2022'> Page 55 unknown widths : [0, IndirectObject(2292, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 06/22/2022'> Page 56 Search <re.Match object; span=(194, 211), match='Visit: 06/22/2022'> Page 57 Search <re.Match object; span=(194, 211), match='Visit: 06/22/2022'> Page 58 Search <re.Match object; span=(251, 268), match='Visit: 05/18/2022'> Page 59 Search <re.Match object; span=(235, 252), match='Visit: 05/18/2022'> Page 60 Search <re.Match object; span=(194, 211), match='Visit: 05/18/2022'> Page 61 Search <re.Match object; span=(584, 601), match='Visit: 05/18/2022'> Page 62 Search <re.Match object; span=(251, 268), match='Visit: 05/18/2022'> Page 63 unknown widths : [0, IndirectObject(2173, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 05/18/2022'> Page 64 Search <re.Match object; span=(194, 211), match='Visit: 05/18/2022'> Page 65 unknown widths : [0, IndirectObject(2124, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 05/18/2022'> Page 66 unknown widths : [0, IndirectObject(2103, 0, 2905784995472)] Search <re.Match object; span=(290, 307), match='Visit: 04/19/2022'> Page 67 unknown widths : [0, IndirectObject(2086, 0, 2905784995472)] Search <re.Match object; span=(313, 330), match='Visit: 04/19/2022'> Page 68 Search <re.Match object; span=(233, 250), match='Visit: 04/19/2022'> Page 69 Search <re.Match object; span=(657, 674), match='Visit: 04/19/2022'> Page 70 unknown widths : [0, IndirectObject(2051, 0, 2905784995472)] unknown widths : [0, IndirectObject(2042, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 04/19/2022'> Page 71 Search <re.Match object; span=(267, 284), match='Visit: 04/19/2022'> Page 72 Search <re.Match object; span=(194, 211), match='Visit: 04/19/2022'> Page 73 unknown widths : [0, IndirectObject(1996, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 04/19/2022'> Page 74 Search <re.Match object; span=(193, 210), match='Visit: 04/19/2022'> Page 75 Search <re.Match object; span=(287, 304), match='Visit: 03/22/2022'> Page 76 Search <re.Match object; span=(235, 252), match='Visit: 03/22/2022'> Page 77 Search <re.Match object; span=(194, 211), match='Visit: 03/22/2022'> Page 78 Search <re.Match object; span=(584, 601), match='Visit: 03/22/2022'> Page 79 Search <re.Match object; span=(251, 268), match='Visit: 03/22/2022'> Page 80 Search <re.Match object; span=(194, 211), match='Visit: 03/22/2022'> Page 81 Search <re.Match object; span=(194, 211), match='Visit: 03/22/2022'> Page 82 unknown widths : [0, IndirectObject(1845, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 03/22/2022'> Page 83 unknown widths : [0, IndirectObject(1820, 0, 2905784995472)] Search <re.Match object; span=(507, 524), match='Visit: 03/22/2022'> Page 84 Search <re.Match object; span=(255, 272), match='Visit: 02/03/2022'> Page 85 Search <re.Match object; span=(194, 211), match='Visit: 02/03/2022'> Page 86 Search <re.Match object; span=(194, 211), match='Visit: 02/03/2022'> Page 87 Search <re.Match object; span=(254, 271), match='Visit: 02/03/2022'> Page 88 Search <re.Match object; span=(194, 211), match='Visit: 02/03/2022'> Page 89 Search <re.Match object; span=(560, 577), match='Visit: 02/03/2022'> Page 90 unknown widths : [0, IndirectObject(1705, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 02/03/2022'> Page 91 Search <re.Match object; span=(194, 211), match='Visit: 02/03/2022'> Page 92 Search <re.Match object; span=(193, 210), match='Visit: 02/03/2022'> Page 93 unknown widths : [0, IndirectObject(1605, 0, 2905784995472)] Search <re.Match object; span=(194, 211), match='Visit: 02/03/2022'> (venv) PS C:\Users\stand\venv>
(Jun-13-2023, 07:14 PM)deanhystad Wrote: What "gives you this stuff"?

Those are not error messages. Is your program printing something, or are this a message from PyPDF2? When are they printed? Are these output when you try to extract text from a page?

I would try something like this to diagnose.
import re
from PyPDF2 import PdfReader

date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")


def split_pdf_by_date(pdf_path):
    pdf = PdfReader(pdf_path)
    for pagenum, page in enumerate(pdf.pages, start=1):
        print("Page", pagenum)
        text = page.extract_text()
        print("Search")
        print(re.search(date_regex, text), "\n")


split_pdf_by_date("Test.pdf")
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020