Python Forum
Convert Scanned PDF to Searchable PDF
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Convert Scanned PDF to Searchable PDF
#1
I'm making software in python and I need to convert a scanned pdf into a digital pdf that is searchable. Using OCRMyPDF module, but the process is very time consuming for large PDFs, as well as consuming a lot of memory and crash or program. I'm running on Windows. Does anyone know of another module or tips I can use?
Reply
#2
Is the scanned PDF available on net in digital format? if so, you could avoid a lot of grief.
OCR is slow by default, understandably if you think of what it has to accomplish.
It will also try to convert and writing or marks that were on the scanned document.
Reply
#3
Maybe pytesseract helps. It can also do OCR on pdf documents. But don't expect a good result.

Edit: It does only output pdf-Docs, but does not read PDF :-/
Then you've to extract all pages from the pdf as images and apply pytesseract on this images.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
In China we just send the book to a Taobao shop, they unbind your book, scan it, ocr it, rebind your book, (maybe slightly damaged), send you a text pdf via Baidu and your book by post.

You can do all this yourself, but it is a hassle. I don't have a duplex scanner, so I scan odd pages and even pages, then merge them, then ocr them.

Forget it! A 600+ page German grammar book cost me 60 Yuan, less than US$ 10, really not worth doing it yourself!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Table extraction from scanned PDF RupamKundu 1 3,724 Aug-03-2019, 02:59 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020