Python Forum
pdf lookalikes - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Forum & Off Topic (https://python-forum.io/forum-23.html)
+--- Forum: Bar (https://python-forum.io/forum-27.html)
+--- Thread: pdf lookalikes (/thread-38215.html)



pdf lookalikes - DPaul - Sep-18-2022

Hi,
Some service-minded villages have made their 1000s of prayer cards available to the public as pdfs.
Or so it seems, because the thing's extension is *.pdf.
But when you open it acrobat says : "This document CLAIMS TO BE a pdf/a file...", waw... they must have used some
alien tool to generate it.

Let's try python: pytesseract, pyplumber, pdfminer, pypdf2.... all open the document but see no text.

I cannot select text with the cursor, but the acrobat cursor is a crosshair, i can select an area, and now i can save that
manually as an image. But not 10.000 times.

So any ideas on yet another module to open alien pdfs ?
thx,
Paul

Edit: don't worry, Python is there for you. I found a 3 trier solution that eventually yields the text using pymupdf. A litthe cumbersome, but it works.