PyPDF2: Find a PDF bookmark with a keyword

Aviator83 · Jul-31-2017, 02:10 PM

Hey everybody, very new to Python and scripting in general except for some MATLAB stuff. I'm trying to use PyPDF2 to take a certain bookmarked portion from a huge file of pdf's and merge them together. I've discovered how to append using page numbers but the real problem is the authors used slightly different wording for the bookmarks in many of the documents.

For each iteration I need to be able to find the page numbers of the bookmark that contains the keyword "equipment" so I can pull that section into the main doc. If anybody could help me out I would be much obliged.

-Troy

my code here

Aviator83 · Aug-01-2017, 06:35 PM

I think I figured it out, it's kind of a Frankenstein of different modified procedures but it works so far. It finds all the bookmark titles and their page numbers then searches for a the desired bookmark and uses the page numbers to slice out that portion and put them all in one pdf. I used the word 'equipment' as my keyword but could be easily modified to find something else. I'm new to the Python thing, I used work from several other people online, I did my best to give credit where it is due, if there's a more formal way to do it please let me know. I'm sure this looks clunky to those with experience but it works if anybody is doing something similar. Cheers...

import os

# The credit for the following code that finds the titles of PDF bookmarks and their corresponding pages, goes  to
#Darrel at https://stackoverflow.com/a/1924950. I merely updated it to use PyPDF2 and added a personalized page number
#finder and appending mechanism. Also credit for the generator expression to find the index of the desired tuple goes
# to Jon Surell https://stackoverflow.com/a/10865345.

import PyPDF2
class Table_Contents (PyPDF2.PdfFileReader):
    def get_the_page_numbers(self):
        def _setup_outline_page_ids(outline, _result=None):
            if _result is None:
                _result = {}
            for obj in outline:
                if isinstance(obj, PyPDF2.pdf.Destination):
                    _result[(id(obj), obj.title)] = obj.page.idnum
                elif isinstance(obj, list):
                    _setup_outline_page_ids(obj, _result)
            return _result

        def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
            if _result is None:
                _result = {}
            if pages is None:
                _num_pages = []
                pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
            t = pages["/Type"]
            if t == "/Pages":
                for page in pages["/Kids"]:
                    _result[page.idnum] = len(_num_pages)
                    _setup_page_id_to_num(page.getObject(), _result, _num_pages)
            elif t == "/Page":
                _num_pages.append(1)
            return _result

        outline_page_ids = _setup_outline_page_ids(self.getOutlines())
        page_id_to_page_numbers = _setup_page_id_to_num()

        result = {}
        for (_, title), page_idnum in outline_page_ids.items():
            result[title] = page_id_to_page_numbers.get(page_idnum, '???')
        return result
path='C:\\Users\\********\\Desktop\\PDF Documents'
merger=PyPDF2.PdfFileMerger()
for(root,dirs,files) in os.walk(path):
    for name in files:
        input1 = (open(os.path.join(root, name), "rb"))
        pdf = Table_Contents(input1 , "rb")
        Dic = sorted([(v, k) for k, v in pdf.get_the_page_numbers().items()])
        for p, t in sorted([(v, k) for k, v in pdf.get_the_page_numbers().items()]):
            if 'Equipment' in t:
                Page_Start = p
                check = next((i for i, v in enumerate(Dic) if v[0] == p), None)
                Page_Stop = ((Dic[check + 1][0]))

        merger.append(input1, bookmark=None, pages=(Page_Start,Page_Stop,1), import_bookmarks=False)

    merger.write("All Equipment Lists.pdf")

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Find a specific keyword after another keyword and change the output	sgtmcc	5	808	Oct-05-2023, 07:41 PM Last Post: deanhystad
	PyPDF2 deprecation problem	gowb0w	5	3,975	Sep-21-2023, 12:38 PM Last Post: Pedroski55
	ModuleNotFoundError: No module named 'PyPDF2'	Benitta2525	1	1,483	Aug-07-2023, 05:32 AM Last Post: DPaul
	Pypdf2 will not find text	standenman	2	934	Feb-03-2023, 10:52 PM Last Post: standenman
	pyPDF2 PDFMerger close pensding file	japo85	2	2,412	Jul-28-2022, 09:49 AM Last Post: japo85
	PyPDF2 processing problem	Pavel_47	6	9,748	May-04-2021, 06:58 AM Last Post: chaitanya
	Word, adding a hyperlink from a bookmark using Python	mart79	1	2,308	Jan-14-2021, 08:41 PM Last Post: Larz60+
	Inserting input into Word doc bookmark	shappaeye	2	2,196	Apr-21-2020, 08:12 AM Last Post: shappaeye
	Problem with installing PyPDF2	Pavel_47	2	6,018	Nov-10-2019, 02:58 PM Last Post: Pavel_47
	pyPDF2 nautilus columns modification	AJBek	1	2,900	Jun-07-2019, 04:17 PM Last Post: micseydel

PyPDF2: Find a PDF bookmark with a keyword

User Panel Messages

Announcements