Python Forum
[PyMuPDF] Grab all strings of a given size?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[PyMuPDF] Grab all strings of a given size?
#1
Question 
Hello,

I need to loop through a bunch of PDFs, each containing one or more articles.

I notice titles can use different fonts, but all seem to have the same size (19,5 points).

Can PyMuPDF (or some other library) grab all strings of a given size in a PDF, so I can build a master Table of Contents?

Thank you.

   
Reply
#2
Don't know about pymupdf, but if you turn the pdf pages into e.g. tifs (using fitz),
pyTesseract can do that. You need something that measures the pixels.
Possibly pyMupdf could do that, depending on how the pdf was made. But I don't use it for that.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#3
Hand over a pdf so that people can try!

Try pdfplumber, it should do what you want!

Joyeux Noël!
Gribouillis likes this post
Reply
#4
Something like this:

import pdfplumber

path2pdf = '/home/pedro/pdfs/pdfs/various_text_sizes.pdf'
pdf = pdfplumber.open(path2pdf)
for i in range(0, 11):
    print(i, pdf.chars[i]['height'], pdf.chars[i]['text'])
The first line of my PDF is centred text at size 28: A Big Title
The output below shows the list index and text height for each character of the first line of a little PDF I made.

Output:
0 28.0 A 1 28.0 2 28.0 B 3 28.0 i 4 28.0 g 5 28.0 6 28.0 T 7 28.0 i 8 28.0 t 9 28.0 l 10 28.0 e
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Can't stop keyboard listener to grab chars typed inside CTk window Valjean 9 1,370 Sep-25-2023, 08:07 PM
Last Post: deanhystad
  Trying to understand strings and lists of strings Konstantin23 2 773 Aug-06-2023, 11:42 AM
Last Post: deanhystad
  PyMuPDF rob101 0 519 Oct-04-2022, 01:11 PM
Last Post: rob101
  [SOLVED] [ElementTree] Grab text in attributes? Winfried 3 1,640 May-27-2022, 04:59 PM
Last Post: Winfried
  Splitting strings in list of strings jesse68 3 1,780 Mar-02-2022, 05:15 PM
Last Post: DeaD_EyE
  Screen capture opencv - grab error Kalman15 1 1,615 Jan-27-2022, 12:22 PM
Last Post: buran
  size of set vs size of dict zweb 0 2,149 Oct-11-2019, 01:32 AM
Last Post: zweb
  Grab and Parse a chunkc of text sumncguy 4 2,403 Oct-07-2019, 05:17 PM
Last Post: Gribouillis
  Finding multiple strings between the two same strings Slither 1 2,529 Jun-05-2019, 09:02 PM
Last Post: Yoriz
  CSV file created is huge in size. How to reduce the size? pramoddsrb 0 10,492 Apr-26-2018, 12:38 AM
Last Post: pramoddsrb

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020