PyPDF2.PDFFileReader

standenman · Jan-31-2018, 03:40 PM

I am trying to use PyPDF2 class PDFFileReader to extract text from the name of a Bookmark. I have used the GetOutlines() function and I get every bookmark. I was hoping to be able to target a specific bookmark. I see from documentation that GetOutines function has arguments (node=None, outlines=None), but I simply cannot find what these arguments mean. Was hoping that they could help me zero in on the bookmark name title that I want.

**Larz60+** · Jan-31-2018, 05:09 PM

You can download the complete PDF spec here: https://www.adobe.com/content/dam/acom/e...0_2008.pdf
download the source code here: https://pypi.python.org/packages/b4/01/6...883d50ee5e

the source code is in the sub directory: PyPDF2

In that document, search for DocumentCatalog to find details.
see addBookmarkDict pdf.py (source code)
GetOutlines method can be found in pdf.py (in the PyPDF2 source code.
Look in that code for node (which looks to be a dictionary, and outlines
node looks to be a dictionary, so you should be able to look at the keys, Values looked for in GetOutline are: "/First" and "/Next",
so I'm guessing that it can be used to iterate over an outline.
outlines is a list, and is fetched with: getOutlines method also in pdf.py

I'm not going to reverse engineer it all, but you should have enough ammunition here to do it yourself.

standenman · Feb-01-2018, 06:06 PM

Thanks for your reply, but I really don't get it.

***snippsat*** · Feb-01-2018, 06:39 PM

(Jan-31-2018, 05:09 PM)Larz60+ Wrote: download the source code here: https://pypi.python.org/packages/b4/01/6...883d50ee5e

He has it already installed in this post.

@standenman the Doc that you should have posted.
I guess that none of us use this package.
So if you want help you need to make easier to test it out.
Which means all your code with eventual Tracback,and also PDF input document.

standenman · Feb-01-2018, 10:26 PM

Well I really don't know what I am doing here or now to get it done. Blush

I am trying to capture text as pairs from the Bookmark names in a pdf file. I have been able to use PyPDF2 to create get a nested list of the pdf docs bookmarks - this being the output from the getOutlines() function of the PdfFileReader class. haven't found a way to really refine the output from the getOutlines() function - was thinking it would be ideal to be able to garb only what I want at that point. But can't find more info on that.

Next I was thinking I could just use python package re to do a Regex but it is too complex for me. I want parse text that beings with /Title': numberF where number begins with 1, and ending with /dd/dd/dddd. In the below example, 12/02/2014. Then I want to parse and store as a pair "CRESCENT MED. CTR. 0F LANCASTER " and "06/19/2013 - 12/02/2014". So I want to grab a string for each 1 through nF section of text. But I only one one string for each numberF - want Regex only to save a string that ends with the "first" dd/mm/yyyy"

1F: Hospital Records (HOSPITAL) Src.: CRESCENT MED. CTR. 0F LANCASTER Tmt. Dt.: 06/19/2013 - 12/02/2014 (105 pages)', '/Page': IndirectObject(797, 0), '/Type': '/FitB'

Any help would be MOST appreciated.

**Larz60+** · Feb-01-2018, 11:00 PM

In the future, You should post all requirements in first post clearly and as completely as possible.
Giving them in the 5th post is frustrating to respondents.
You have been given tools that you seem unwilling to use.
Please create an item by item list of what your goal is, what you have tried,
and what exactly is not working.
Based on ypur first post comment:

Quote:I have used the GetOutlines() function and I get every bookmark

I have spent a good amount of time trying to get an answer to that.

standenman · Feb-01-2018, 11:38 PM

Well, sorry. I never said I could not use getOutlines(), only that was hoping someone might have a clue to what the arguments are so that I might be able to zero in on the bookmarks that I am interested in. Sorry to have wasted your time. My question changed because I could not solve the problem with getOutlines() so I went another director. Sorry to have asked too many questions.

**Larz60+** · Feb-02-2018, 01:21 AM

Your not wasting my time, just asking that you be clear of your objective.
The argument node, looks to be a means by which you could issue a node='/First'
to get the first outline, followed by successive calls to '/Next' to traverse the 'tree'
and get additional additional entries. I'm not setup to test this theory, but I think you are.

standenman · Feb-02-2018, 04:32 AM

Thanks for your response. The PDFs I am working with were created by a third party. Function getDocumentInfo()reveals the PDF was made in itext (/Producer': 'iText® 5.4.0) Would I be better off coming at it from itext?

**Larz60+** · Feb-02-2018, 05:37 AM

I've never used this module stand alone, but do use the wxpython implementation
of wx.lib.pdfviewer, which wil use either PyPDF2 or PyMuPDF (whichever it finds installed)
They have a base application that might be helpful, documented here:
https://wxpython.org/Phoenix/docs/html/w....pdfviewer
including a complete example.
Other examples, of using PyPDF2(without GUI) can be found here: http://nullege.com/codes/search?cq=PyPDF2
and specifically using GetOutlines here: http://nullege.com/codes/search/pdf.PdfF...etOutlines

If you want to try the GUI example, you need to install wxpython which is simple:

pip install wxpython

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	PyPDF2 deprecation problem	gowb0w	5	4,085	Sep-21-2023, 12:38 PM Last Post: Pedroski55
	ModuleNotFoundError: No module named 'PyPDF2'	Benitta2525	1	1,509	Aug-07-2023, 05:32 AM Last Post: DPaul
	Pypdf2 will not find text	standenman	2	943	Feb-03-2023, 10:52 PM Last Post: standenman
	pyPDF2 PDFMerger close pensding file	japo85	2	2,431	Jul-28-2022, 09:49 AM Last Post: japo85
	PyPDF2 processing problem	Pavel_47	6	9,771	May-04-2021, 06:58 AM Last Post: chaitanya
	Problem with installing PyPDF2	Pavel_47	2	6,033	Nov-10-2019, 02:58 PM Last Post: Pavel_47
	pyPDF2 nautilus columns modification	AJBek	1	2,910	Jun-07-2019, 04:17 PM Last Post: micseydel
	Using Pypdf2 write a string to a pdf file	Pedroski55	6	20,316	Apr-11-2019, 11:10 PM Last Post: snippsat
	Merging pdfs with PyPDF2	Pedroski55	0	3,289	Mar-07-2019, 11:58 PM Last Post: Pedroski55
	PyPDF2 encrypt	Truman	3	5,439	Jan-19-2019, 12:18 AM Last Post: snippsat

PyPDF2.PDFFileReader

User Panel Messages

Announcements