![]() |
[PyMuPDF] Grab all strings of a given size? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: [PyMuPDF] Grab all strings of a given size? (/thread-41301.html) |
[PyMuPDF] Grab all strings of a given size? - Winfried - Dec-16-2023 Hello, I need to loop through a bunch of PDFs, each containing one or more articles. I notice titles can use different fonts, but all seem to have the same size (19,5 points). Can PyMuPDF (or some other library) grab all strings of a given size in a PDF, so I can build a master Table of Contents? Thank you. [attachment=2683] RE: [PyMuPDF] Grab all strings of a given size? - DPaul - Dec-24-2023 Don't know about pymupdf, but if you turn the pdf pages into e.g. tifs (using fitz), pyTesseract can do that. You need something that measures the pixels. Possibly pyMupdf could do that, depending on how the pdf was made. But I don't use it for that. Paul RE: [PyMuPDF] Grab all strings of a given size? - Pedroski55 - Dec-25-2023 Hand over a pdf so that people can try! Try pdfplumber, it should do what you want! Joyeux Noël! RE: [PyMuPDF] Grab all strings of a given size? - Pedroski55 - Dec-26-2023 Something like this: import pdfplumber path2pdf = '/home/pedro/pdfs/pdfs/various_text_sizes.pdf' pdf = pdfplumber.open(path2pdf) for i in range(0, 11): print(i, pdf.chars[i]['height'], pdf.chars[i]['text'])The first line of my PDF is centred text at size 28: A Big Title The output below shows the list index and text height for each character of the first line of a little PDF I made.
|