Python Forum

Full Version: Convert Scanned PDF to Searchable PDF
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm making software in python and I need to convert a scanned pdf into a digital pdf that is searchable. Using OCRMyPDF module, but the process is very time consuming for large PDFs, as well as consuming a lot of memory and crash or program. I'm running on Windows. Does anyone know of another module or tips I can use?
Is the scanned PDF available on net in digital format? if so, you could avoid a lot of grief.
OCR is slow by default, understandably if you think of what it has to accomplish.
It will also try to convert and writing or marks that were on the scanned document.
Maybe pytesseract helps. It can also do OCR on pdf documents. But don't expect a good result.

Edit: It does only output pdf-Docs, but does not read PDF :-/
Then you've to extract all pages from the pdf as images and apply pytesseract on this images.
In China we just send the book to a Taobao shop, they unbind your book, scan it, ocr it, rebind your book, (maybe slightly damaged), send you a text pdf via Baidu and your book by post.

You can do all this yourself, but it is a hassle. I don't have a duplex scanner, so I scan odd pages and even pages, then merge them, then ocr them.

Forget it! A 600+ page German grammar book cost me 60 Yuan, less than US$ 10, really not worth doing it yourself!