Python Forum
PyPDF2: Find a PDF bookmark with a keyword
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PyPDF2: Find a PDF bookmark with a keyword
#1
Hey everybody, very new to Python and scripting in general except for some MATLAB stuff. I'm trying to use PyPDF2 to take a certain bookmarked portion from a huge file of pdf's and merge them together. I've discovered how to append using page numbers but the real problem is the authors used slightly different wording for the bookmarks in many of the documents.

For each iteration I need to be able to find the page numbers of the bookmark that contains the keyword "equipment" so I can pull that section into the main doc. If anybody could help me out I would be much obliged.

-Troy


my code here
Reply
#2
I think I figured it out, it's kind of a Frankenstein of different modified procedures but it works so far. It finds all the bookmark titles and their page numbers then searches for a the desired bookmark and uses the page numbers to slice out that portion and put them all in one pdf. I used the word 'equipment' as my keyword but could be easily modified to find something else. I'm new to the Python thing, I used work from several other people online, I did my best to give credit where it is due, if there's a more formal way to do it please let me know. I'm sure this looks clunky to those with experience but it works if anybody is doing something similar. Cheers...

import os

# The credit for the following code that finds the titles of PDF bookmarks and their corresponding pages, goes  to
#Darrel at https://stackoverflow.com/a/1924950. I merely updated it to use PyPDF2 and added a personalized page number
#finder and appending mechanism. Also credit for the generator expression to find the index of the desired tuple goes
# to Jon Surell https://stackoverflow.com/a/10865345.

import PyPDF2
class Table_Contents (PyPDF2.PdfFileReader):
    def get_the_page_numbers(self):
        def _setup_outline_page_ids(outline, _result=None):
            if _result is None:
                _result = {}
            for obj in outline:
                if isinstance(obj, PyPDF2.pdf.Destination):
                    _result[(id(obj), obj.title)] = obj.page.idnum
                elif isinstance(obj, list):
                    _setup_outline_page_ids(obj, _result)
            return _result

        def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
            if _result is None:
                _result = {}
            if pages is None:
                _num_pages = []
                pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
            t = pages["/Type"]
            if t == "/Pages":
                for page in pages["/Kids"]:
                    _result[page.idnum] = len(_num_pages)
                    _setup_page_id_to_num(page.getObject(), _result, _num_pages)
            elif t == "/Page":
                _num_pages.append(1)
            return _result

        outline_page_ids = _setup_outline_page_ids(self.getOutlines())
        page_id_to_page_numbers = _setup_page_id_to_num()

        result = {}
        for (_, title), page_idnum in outline_page_ids.items():
            result[title] = page_id_to_page_numbers.get(page_idnum, '???')
        return result
path='C:\\Users\\********\\Desktop\\PDF Documents'
merger=PyPDF2.PdfFileMerger()
for(root,dirs,files) in os.walk(path):
    for name in files:
        input1 = (open(os.path.join(root, name), "rb"))
        pdf = Table_Contents(input1 , "rb")
        Dic = sorted([(v, k) for k, v in pdf.get_the_page_numbers().items()])
        for p, t in sorted([(v, k) for k, v in pdf.get_the_page_numbers().items()]):
            if 'Equipment' in t:
                Page_Start = p
                check = next((i for i, v in enumerate(Dic) if v[0] == p), None)
                Page_Stop = ((Dic[check + 1][0]))

        merger.append(input1, bookmark=None, pages=(Page_Start,Page_Stop,1), import_bookmarks=False)

    merger.write("All Equipment Lists.pdf")
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Find a specific keyword after another keyword and change the output sgtmcc 5 808 Oct-05-2023, 07:41 PM
Last Post: deanhystad
  PyPDF2 deprecation problem gowb0w 5 3,975 Sep-21-2023, 12:38 PM
Last Post: Pedroski55
  ModuleNotFoundError: No module named 'PyPDF2' Benitta2525 1 1,483 Aug-07-2023, 05:32 AM
Last Post: DPaul
  Pypdf2 will not find text standenman 2 934 Feb-03-2023, 10:52 PM
Last Post: standenman
  pyPDF2 PDFMerger close pensding file japo85 2 2,412 Jul-28-2022, 09:49 AM
Last Post: japo85
  PyPDF2 processing problem Pavel_47 6 9,747 May-04-2021, 06:58 AM
Last Post: chaitanya
Question Word, adding a hyperlink from a bookmark using Python mart79 1 2,308 Jan-14-2021, 08:41 PM
Last Post: Larz60+
  Inserting input into Word doc bookmark shappaeye 2 2,196 Apr-21-2020, 08:12 AM
Last Post: shappaeye
  Problem with installing PyPDF2 Pavel_47 2 6,018 Nov-10-2019, 02:58 PM
Last Post: Pavel_47
  pyPDF2 nautilus columns modification AJBek 1 2,900 Jun-07-2019, 04:17 PM
Last Post: micseydel

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020