Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PyPDF2.PDFFileReader
#1
I am trying to use PyPDF2 class PDFFileReader to extract text from the name of a Bookmark. I have used the GetOutlines() function and I get every bookmark. I was hoping to be able to target a specific bookmark. I see from documentation that GetOutines function has arguments (node=None, outlines=None), but I simply cannot find what these arguments mean. Was hoping that they could help me zero in on the bookmark name title that I want.
Reply
#2
You can download the complete PDF spec here: https://www.adobe.com/content/dam/acom/e...0_2008.pdf
download the source code here: https://pypi.python.org/packages/b4/01/6...883d50ee5e

the source code is in the sub directory: PyPDF2

In that document, search for DocumentCatalog to find details.
see addBookmarkDict pdf.py (source code)
GetOutlines method can be found in pdf.py (in the PyPDF2 source code.
Look in that code for node (which looks to be a dictionary, and outlines
node looks to be a dictionary, so you should be able to look at the keys, Values looked for in GetOutline are: "/First" and "/Next",
so I'm guessing that it can be used to iterate over an outline.
outlines is a list, and is fetched with: getOutlines method also in pdf.py

I'm not going to reverse engineer it all, but you should have enough ammunition here to do it yourself.
Reply
#3
Thanks for your reply, but I really don't get it.
Reply
#4
(Jan-31-2018, 05:09 PM)Larz60+ Wrote: download the source code here: https://pypi.python.org/packages/b4/01/6...883d50ee5e
He has it already installed in this post.

@standenman the Doc that you should have posted.
I guess that none of us use this package.
So if you want help you need to make easier to test it out.
Which means all your code with eventual Tracback,and also PDF input document.
Reply
#5
Well I really don't know what I am doing here or now to get it done. Blush I am trying to capture text as pairs from the Bookmark names in a pdf file. I have been able to use PyPDF2 to create get a nested list of the pdf docs bookmarks - this being the output from the getOutlines() function of the PdfFileReader class. haven't found a way to really refine the output from the getOutlines() function - was thinking it would be ideal to be able to garb only what I want at that point. But can't find more info on that.

Next I was thinking I could just use python package re to do a Regex but it is too complex for me. I want parse text that beings with /Title': numberF where number begins with 1, and ending with /dd/dd/dddd. In the below example, 12/02/2014. Then I want to parse and store as a pair "CRESCENT MED. CTR. 0F LANCASTER " and "06/19/2013 - 12/02/2014". So I want to grab a string for each 1 through nF section of text. But I only one one string for each numberF - want Regex only to save a string that ends with the "first" dd/mm/yyyy"

1F: Hospital Records (HOSPITAL) Src.: CRESCENT MED. CTR. 0F LANCASTER Tmt. Dt.: 06/19/2013 - 12/02/2014 (105 pages)', '/Page': IndirectObject(797, 0), '/Type': '/FitB'

Any help would be MOST appreciated.
Reply
#6
In the future, You should post all requirements in first post clearly and as completely as possible.
Giving them in the 5th post is frustrating to respondents.
You have been given tools that you seem unwilling to use.
Please create an item by item list of what your goal is, what you have tried,
and what exactly is not working.
Based on ypur first post comment:
Quote:I have used the GetOutlines() function and I get every bookmark
I have spent a good amount of time trying to get an answer to that.
Reply
#7
Well, sorry. I never said I could not use getOutlines(), only that was hoping someone might have a clue to what the arguments are so that I might be able to zero in on the bookmarks that I am interested in. Sorry to have wasted your time. My question changed because I could not solve the problem with getOutlines() so I went another director. Sorry to have asked too many questions.
Reply
#8
Your not wasting my time, just asking that you be clear of your objective.
The argument node, looks to be a means by which you could issue a node='/First'
to get the first outline, followed by successive calls to '/Next' to traverse the 'tree'
and get additional additional entries. I'm not setup to test this theory, but I think you are.
Reply
#9
Thanks for your response. The PDFs I am working with were created by a third party. Function getDocumentInfo()reveals the PDF was made in itext (/Producer': 'iText® 5.4.0) Would I be better off coming at it from itext?
Reply
#10
I've never used this module stand alone, but do use the wxpython implementation
of wx.lib.pdfviewer, which wil use either PyPDF2 or PyMuPDF (whichever it finds installed)
They have a base application that might be helpful, documented here:
https://wxpython.org/Phoenix/docs/html/w....pdfviewer
including a complete example.
Other examples, of using PyPDF2(without GUI) can be found here: http://nullege.com/codes/search?cq=PyPDF2
and specifically using GetOutlines here: http://nullege.com/codes/search/pdf.PdfF...etOutlines

If you want to try the GUI example, you need to install wxpython which is simple:
pip install wxpython
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  PyPDF2 deprecation problem gowb0w 5 4,085 Sep-21-2023, 12:38 PM
Last Post: Pedroski55
  ModuleNotFoundError: No module named 'PyPDF2' Benitta2525 1 1,509 Aug-07-2023, 05:32 AM
Last Post: DPaul
  Pypdf2 will not find text standenman 2 943 Feb-03-2023, 10:52 PM
Last Post: standenman
  pyPDF2 PDFMerger close pensding file japo85 2 2,431 Jul-28-2022, 09:49 AM
Last Post: japo85
  PyPDF2 processing problem Pavel_47 6 9,771 May-04-2021, 06:58 AM
Last Post: chaitanya
  Problem with installing PyPDF2 Pavel_47 2 6,033 Nov-10-2019, 02:58 PM
Last Post: Pavel_47
  pyPDF2 nautilus columns modification AJBek 1 2,910 Jun-07-2019, 04:17 PM
Last Post: micseydel
  Using Pypdf2 write a string to a pdf file Pedroski55 6 20,316 Apr-11-2019, 11:10 PM
Last Post: snippsat
  Merging pdfs with PyPDF2 Pedroski55 0 3,289 Mar-07-2019, 11:58 PM
Last Post: Pedroski55
  PyPDF2 encrypt Truman 3 5,439 Jan-19-2019, 12:18 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020