[PyMuPDF] Grab all strings of a given size?

Winfried · Dec-16-2023, 12:51 PM

Hello,

I need to loop through a bunch of PDFs, each containing one or more articles.

I notice titles can use different fonts, but all seem to have the same size (19,5 points).

Can PyMuPDF (or some other library) grab all strings of a given size in a PDF, so I can build a master Table of Contents?

Thank you.

DPaul · (This post was last modified: Dec-24-2023, 06:14 PM by DPaul.)

Don't know about pymupdf, but if you turn the pdf pages into e.g. tifs (using fitz),
pyTesseract can do that. You need something that measures the pixels.
Possibly pyMupdf could do that, depending on how the pdf was made. But I don't use it for that.
Paul

Pedroski55 · Dec-25-2023, 08:51 AM

Hand over a pdf so that people can try!

Try pdfplumber, it should do what you want!

Joyeux Noël!

Pedroski55 · Dec-26-2023, 07:39 AM

Something like this:

import pdfplumber

path2pdf = '/home/pedro/pdfs/pdfs/various_text_sizes.pdf'
pdf = pdfplumber.open(path2pdf)
for i in range(0, 11):
    print(i, pdf.chars[i]['height'], pdf.chars[i]['text'])

The first line of my PDF is centred text at size 28: A Big Title
The output below shows the list index and text height for each character of the first line of a little PDF I made.

Output:0 28.0 A
1 28.0  
2 28.0 B
3 28.0 i
4 28.0 g
5 28.0  
6 28.0 T
7 28.0 i
8 28.0 t
9 28.0 l
10 28.0 e

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	scan network and grab details of hosts found	robertkwild	5	1,251	Aug-07-2024, 05:21 PM Last Post: Larz60+
	Python Code Help - pip install PyMuPDF python-docx pillow	Splishsplash92	3	1,803	Jun-05-2024, 06:49 AM Last Post: Pedroski55
	Can't stop keyboard listener to grab chars typed inside CTk window	Valjean	9	3,400	Sep-25-2023, 08:07 PM Last Post: deanhystad
	Trying to understand strings and lists of strings	Konstantin23	2	1,699	Aug-06-2023, 11:42 AM Last Post: deanhystad
	PyMuPDF	rob101	0	1,052	Oct-04-2022, 01:11 PM Last Post: rob101
	[SOLVED] [ElementTree] Grab text in attributes?	Winfried	3	2,455	May-27-2022, 04:59 PM Last Post: Winfried
	Splitting strings in list of strings	jesse68	3	2,529	Mar-02-2022, 05:15 PM Last Post: DeaD_EyE
	Screen capture opencv - grab error	Kalman15	1	2,642	Jan-27-2022, 12:22 PM Last Post: buran
	size of set vs size of dict	zweb	0	2,613	Oct-11-2019, 01:32 AM Last Post: zweb
	Grab and Parse a chunkc of text	sumncguy	4	3,288	Oct-07-2019, 05:17 PM Last Post: Gribouillis

[PyMuPDF] Grab all strings of a given size?

User Panel Messages

Announcements