Python Forum
Creating a code to make unreadable pdf characters readable
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Creating a code to make unreadable pdf characters readable
#1
Hi There!

I just got into machine learning and data science and was wondering if it would be possible to create a code for the following:

In my daily work I have to sift trough thousands of old documents that where scannend at some point back in the early 2000's.
A lot of these documents are missing crucial data, simply because the scanning process was sloppy and not of good quality.
I was looking to create a code to recognize the damaged text (mostly numbers and practically all in the same font) and then comparing them against a known set of good quality characters. So the purpose of this code would be to make unreadable characters readable again if the code finds a (partial or good enough match) with what the character should be). I've added an example of what the texts look like in different occasions.

Note:
The Lorem Ipsum text is not really important, because it's almost always generic, it's really about the document numbers and dates.
So in the example image A and B are not that damaged and quote readable. And C and D are almost impossible to read and if the text is damaged really bad, then a 0 could be a 0 an 8 a 6 a 5 ir a 3 maybe even a 2. So it would be great if there was a way to compare the images (maybe pixels) and layer them on top of each other to match the data and that it should be able to output this information as a true or false statement, or when there are multiple possible matches that gives an out put of two possible options like most likely it's an 8(78% match), but it could also be a 3 (11%) or a 0 (11%).

So I honestly don't even know if this is even possible, I've read some things about OCR and Anomaly Detection but i'm not sure how I could implement these techniques to reach my goal.

Could somebody shine some light on this?

Thank you in advance :)

This is the image
[Image: view?usp=sharing]
Reply
#2
with pytesseract which is a wrapper around Google’s Tesseract-OCR Engine at: https://github.com/tesseract-ocr/tesseract
you can include a config file that can be tweeked to try and get a better result. I'm not an expert, so can only point you in the direction of the application, see: https://pypi.org/project/pytesseract/
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Deleting characters between certain characters stahorse 7 1,126 Jul-03-2023, 12:59 AM
Last Post: Pedroski55

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020