Python Forum

Full Version: pdf lookalikes
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,
Some service-minded villages have made their 1000s of prayer cards available to the public as pdfs.
Or so it seems, because the thing's extension is *.pdf.
But when you open it acrobat says : "This document CLAIMS TO BE a pdf/a file...", waw... they must have used some
alien tool to generate it.

Let's try python: pytesseract, pyplumber, pdfminer, pypdf2.... all open the document but see no text.

I cannot select text with the cursor, but the acrobat cursor is a crosshair, i can select an area, and now i can save that
manually as an image. But not 10.000 times.

So any ideas on yet another module to open alien pdfs ?
thx,
Paul

Edit: don't worry, Python is there for you. I found a 3 trier solution that eventually yields the text using pymupdf. A litthe cumbersome, but it works.