I often have to get a range of pages from a pdf, so I made a little Python to do that. Your task is similar, so I would make a dictionary whose keys are the dates and whose values are each a list with the page numbers where those dates are found.
Then you can get a range of pages using just that data from the dictionary.
The pdf I made had 2 different dates on 1 page. Maybe you don't have that in your main pdf, so that makes life easier and re.search will do.
If there are different dates on 1 page, life becomes a little more difficult, have to think of a way to eliminate the text you do not want. A little function perhaps.
Also, text = pages[i].extract_text() is not always reliable. Sometimes you won't get any text. If you have that problem with PyPDF2, try using pdfminer.
Snippsat gave me that tip. pdfminer can get text which PyPDF2 cannot get for some reason.
I just altered my get_page_range.py here:
def myApp():
import re
from PyPDF2 import PdfReader, PdfWriter
sourceFile = '/home/pedro/pdfs/pdfs/doctor_visits.pdf'
savepath = '/home/pedro/pdfs/pdfs/'
# date_regex = re.compile(r"Visit: (\d{2}/\d{2}/\d{4})")
date_regex = re.compile(r"(\d{2}/\d{2}/\d{4})")
# read the pdf
pdf = PdfReader(sourceFile)
pages = pdf.pages
numpages = len(pdf.pages)
print('This pdf has ' + str(pages) + ' pages')
# a dictionary to take the dates and the page numbers where those dates are found
dates_visits = {}
# initialize dates_visits with an empty list
# some pages may have more than 1 date
for i in range(numpages):
text = pages[i].extract_text()
date_match = re.findall(date_regex, text)
if date_match:
print('Date match is', date_match)
date_set = set(date_match)
print('Date set is', date_set)
# dates may only be 1 long, but may be more
dates = list(date_set)
for j in range(len(dates)):
date = dates[j]
dates_visits[date] = []
for i in range(numpages):
text = pages[i].extract_text()
date_match = re.findall(date_regex, text)
# date_match is a list of dates found on each page
if date_match:
# get rid of repeated dates
date_set = set(date_match)
dates = list(date_set)
for j in range(len(dates)):
date = dates[j]
if not i in dates_visits[date]:
dates_visits[date].append(i)
for item in dates_visits.items():
print(item)
# get the relevant page numbers from the dates_visits dictionary and extract those pages to a pdf
for key in dates_visits.keys():
print('key is', key)
savename = 'patient_visits_' + key.replace('/', '_') + '.pdf'
print('Saving to', savepath + savename)
start = dates_visits[key][0]
print('Start page is', start)
stop = dates_visits[key][-1]
print('Stop page is', stop)
pdf_writer = PdfWriter()
for p in range(start, stop + 1):
pdf_writer.add_page(pdf.pages[p])
with open(savepath + savename, 'wb') as out:
pdf_writer.write(out)
print(f'Created: {savename} and saved in', savepath)