Python Forum
Splitt PDF at regex value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Splitt PDF at regex value
#11
PDF files are full of indirect objects. If something appears multiple times in the original document, the PDF stores it once, and uses indirect objects to reference it. I don't know why some of these have "unknown widths", but those are not errors, and might not affect the success/failure of finding dates and extracting pages.

If you ignore this:
Output:
unknown widths : [0, IndirectObject(3121, 0, 2905784995472)]
does scanning the document for dates appear to work? Did it miss any dates? The PDF you scanned for the example does not have a date on the first page, but it finds a date on all the other pages.
Reply
#12
I often have to get a range of pages from a pdf, so I made a little Python to do that. Your task is similar, so I would make a dictionary whose keys are the dates and whose values are each a list with the page numbers where those dates are found.

Then you can get a range of pages using just that data from the dictionary.

The pdf I made had 2 different dates on 1 page. Maybe you don't have that in your main pdf, so that makes life easier and re.search will do.

If there are different dates on 1 page, life becomes a little more difficult, have to think of a way to eliminate the text you do not want. A little function perhaps.

Also, text = pages[i].extract_text() is not always reliable. Sometimes you won't get any text. If you have that problem with PyPDF2, try using pdfminer.

Snippsat gave me that tip. pdfminer can get text which PyPDF2 cannot get for some reason.

I just altered my get_page_range.py here:

def myApp():
    import re
    from PyPDF2 import PdfReader, PdfWriter

    sourceFile = '/home/pedro/pdfs/pdfs/doctor_visits.pdf'
    savepath = '/home/pedro/pdfs/pdfs/'
    # date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")
    date_regex = re.compile(r"(\d{2}/\d{2}/\d{4})")
    # read the pdf
    pdf = PdfReader(sourceFile)
    pages = pdf.pages
    numpages = len(pdf.pages)
    print('This pdf has ' + str(pages) + ' pages')
    # a dictionary to take the dates and the page numbers where those dates are found
    dates_visits = {}
    # initialize dates_visits with an empty list
    # some pages may have more than 1 date
    for i in range(numpages):
        text = pages[i].extract_text()
        date_match = re.findall(date_regex, text)
        if date_match:
            print('Date match is', date_match)                 
            date_set = set(date_match)
            print('Date set is', date_set)
            # dates may only be 1 long, but may be more
            dates = list(date_set)
            for j in range(len(dates)):
                date = dates[j]
                dates_visits[date] = []
    for i in range(numpages):
        text = pages[i].extract_text()
        date_match = re.findall(date_regex, text)
        # date_match is a list of dates found on each page
        if date_match:
            # get rid of repeated dates
            date_set = set(date_match)
            dates = list(date_set)
            for j in range(len(dates)):
                date = dates[j]
                if not i in dates_visits[date]:
                    dates_visits[date].append(i)

    for item in dates_visits.items():
        print(item)

    # get the relevant page numbers from the dates_visits dictionary and extract those pages to a pdf
    for key in dates_visits.keys():
        print('key is', key)        
        savename = 'patient_visits_' + key.replace('/', '_') + '.pdf'
        print('Saving to', savepath + savename)
        start = dates_visits[key][0]
        print('Start page is', start)
        stop = dates_visits[key][-1]
        print('Stop page is', stop)
        pdf_writer = PdfWriter()
        for p in range(start, stop + 1):        
            pdf_writer.add_page(pdf.pages[p])
        with open(savepath + savename, 'wb') as out:
            pdf_writer.write(out)
            print(f'Created: {savename} and saved in', savepath)
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020