pdf lookalikes - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Forum & Off Topic (https://python-forum.io/forum-23.html) +--- Forum: Bar (https://python-forum.io/forum-27.html) +--- Thread: pdf lookalikes (/thread-38215.html) |
pdf lookalikes - DPaul - Sep-18-2022 Hi, Some service-minded villages have made their 1000s of prayer cards available to the public as pdfs. Or so it seems, because the thing's extension is *.pdf. But when you open it acrobat says : "This document CLAIMS TO BE a pdf/a file...", waw... they must have used some alien tool to generate it. Let's try python: pytesseract, pyplumber, pdfminer, pypdf2.... all open the document but see no text. I cannot select text with the cursor, but the acrobat cursor is a crosshair, i can select an area, and now i can save that manually as an image. But not 10.000 times. So any ideas on yet another module to open alien pdfs ? thx, Paul Edit: don't worry, Python is there for you. I found a 3 trier solution that eventually yields the text using pymupdf. A litthe cumbersome, but it works. |