Python Forum
Convert Scanned PDF to Searchable PDF - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Convert Scanned PDF to Searchable PDF (/thread-36625.html)



Convert Scanned PDF to Searchable PDF - fuzzin - Mar-11-2022

I'm making software in python and I need to convert a scanned pdf into a digital pdf that is searchable. Using OCRMyPDF module, but the process is very time consuming for large PDFs, as well as consuming a lot of memory and crash or program. I'm running on Windows. Does anyone know of another module or tips I can use?


RE: Convert Scanned PDF to Searchable PDF - Larz60+ - Mar-11-2022

Is the scanned PDF available on net in digital format? if so, you could avoid a lot of grief.
OCR is slow by default, understandably if you think of what it has to accomplish.
It will also try to convert and writing or marks that were on the scanned document.


RE: Convert Scanned PDF to Searchable PDF - DeaD_EyE - Mar-11-2022

Maybe pytesseract helps. It can also do OCR on pdf documents. But don't expect a good result.

Edit: It does only output pdf-Docs, but does not read PDF :-/
Then you've to extract all pages from the pdf as images and apply pytesseract on this images.


RE: Convert Scanned PDF to Searchable PDF - Pedroski55 - Mar-11-2022

In China we just send the book to a Taobao shop, they unbind your book, scan it, ocr it, rebind your book, (maybe slightly damaged), send you a text pdf via Baidu and your book by post.

You can do all this yourself, but it is a hassle. I don't have a duplex scanner, so I scan odd pages and even pages, then merge them, then ocr them.

Forget it! A 600+ page German grammar book cost me 60 Yuan, less than US$ 10, really not worth doing it yourself!