Python Forum
Application to find title in PDF?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Application to find title in PDF?
#1
Hello,

I need to loop through PDF files where the title is always somewhere in page 3 — The metadata doesn't contain the actual title.

pdftitle didn't work.

PDFminer outputs everything, but is there a way to tell it to just look for something that looks like a title? Or try another application?

pdf2txt.py -p 3 input.pdf
Thank you.
Reply
#2
you need something that will extract the header.
see: https://github.com/pymupdf/PyMuPDF which will do this
installation instructions included on the github page.
There is a blog on using this to extract header information here: https://towardsdatascience.com/extractin...6e8421c467
Reply
#3
Using PyPDF2, the following code will read and print the text on the third page. Print it to the terminal and look to determine what criteria you will use to isolate just the title. If you will post an example of one of the pdfs that your are scanning, I might be able to help further.

import PyPDF2

pdf_file = open ('test.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader (pdf_file)
third_page = pdf_reader.getPage (2)
print (third_page.extractText ())
Reply
#4
Thanks for the pointers to PyMuPDF and PyPDF2.

I know the title on page 3 always uses Arial bold 19,5 pt, but it can stretch to two or three lines

What would be the simplest way to get the string?

Can those packages build a tree that I can navigate à la lxml?
Reply
#5
For others' benefit: If the PDF contains a ToC ("bookmarks"), it might have what you're looking for.

#pip install PyMuPDF
import fitz

doc = fitz.open("input.pdf")
toc = doc.getToC()
for item in toc:
	print(item[1])
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to send data from a python application to an external application aditya_rajiv 1 2,173 Jul-26-2021, 06:00 AM
Last Post: ndc85430
  How to change font size of chart title and axis title ? thrupass 5 15,560 Mar-30-2018, 04:02 PM
Last Post: DrFunn1

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020