[PyMuPDF] Grab all strings of a given size?

[PyMuPDF] Grab all strings of a given size? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: [PyMuPDF] Grab all strings of a given size? (/thread-41301.html)

[PyMuPDF] Grab all strings of a given size? - Winfried - Dec-16-2023

Hello,

I need to loop through a bunch of PDFs, each containing one or more articles.

I notice titles can use different fonts, but all seem to have the same size (19,5 points).

Can PyMuPDF (or some other library) grab all strings of a given size in a PDF, so I can build a master Table of Contents?

Thank you.

[attachment=2683]

RE: [PyMuPDF] Grab all strings of a given size? - DPaul - Dec-24-2023

Don't know about pymupdf, but if you turn the pdf pages into e.g. tifs (using fitz),
pyTesseract can do that. You need something that measures the pixels.
Possibly pyMupdf could do that, depending on how the pdf was made. But I don't use it for that.
Paul

RE: [PyMuPDF] Grab all strings of a given size? - Pedroski55 - Dec-25-2023

Hand over a pdf so that people can try!

Try pdfplumber, it should do what you want!

Joyeux Noël!

RE: [PyMuPDF] Grab all strings of a given size? - Pedroski55 - Dec-26-2023

Something like this:

import pdfplumber

path2pdf = '/home/pedro/pdfs/pdfs/various_text_sizes.pdf'
pdf = pdfplumber.open(path2pdf)
for i in range(0, 11):
    print(i, pdf.chars[i]['height'], pdf.chars[i]['text'])

The first line of my PDF is centred text at size 28: A Big Title
The output below shows the list index and text height for each character of the first line of a little PDF I made.

Output:0 28.0 A
1 28.0  
2 28.0 B
3 28.0 i
4 28.0 g
5 28.0  
6 28.0 T
7 28.0 i
8 28.0 t
9 28.0 l
10 28.0 e