Python Forum

Full Version: [PyMuPDF] Grab all strings of a given size?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

I need to loop through a bunch of PDFs, each containing one or more articles.

I notice titles can use different fonts, but all seem to have the same size (19,5 points).

Can PyMuPDF (or some other library) grab all strings of a given size in a PDF, so I can build a master Table of Contents?

Thank you.

[attachment=2683]
Don't know about pymupdf, but if you turn the pdf pages into e.g. tifs (using fitz),
pyTesseract can do that. You need something that measures the pixels.
Possibly pyMupdf could do that, depending on how the pdf was made. But I don't use it for that.
Paul
Hand over a pdf so that people can try!

Try pdfplumber, it should do what you want!

Joyeux Noël!
Something like this:

import pdfplumber

path2pdf = '/home/pedro/pdfs/pdfs/various_text_sizes.pdf'
pdf = pdfplumber.open(path2pdf)
for i in range(0, 11):
    print(i, pdf.chars[i]['height'], pdf.chars[i]['text'])
The first line of my PDF is centred text at size 28: A Big Title
The output below shows the list index and text height for each character of the first line of a little PDF I made.

Output:
0 28.0 A 1 28.0 2 28.0 B 3 28.0 i 4 28.0 g 5 28.0 6 28.0 T 7 28.0 i 8 28.0 t 9 28.0 l 10 28.0 e