Creating a code to make unreadable pdf characters readable

Tegendraads · (This post was last modified: Feb-03-2020, 09:11 PM by Tegendraads.)

Hi There!

I just got into machine learning and data science and was wondering if it would be possible to create a code for the following:

In my daily work I have to sift trough thousands of old documents that where scannend at some point back in the early 2000's.
A lot of these documents are missing crucial data, simply because the scanning process was sloppy and not of good quality.
I was looking to create a code to recognize the damaged text (mostly numbers and practically all in the same font) and then comparing them against a known set of good quality characters. So the purpose of this code would be to make unreadable characters readable again if the code finds a (partial or good enough match) with what the character should be). I've added an example of what the texts look like in different occasions.

Note:
The Lorem Ipsum text is not really important, because it's almost always generic, it's really about the document numbers and dates.
So in the example image A and B are not that damaged and quote readable. And C and D are almost impossible to read and if the text is damaged really bad, then a 0 could be a 0 an 8 a 6 a 5 ir a 3 maybe even a 2. So it would be great if there was a way to compare the images (maybe pixels) and layer them on top of each other to match the data and that it should be able to output this information as a true or false statement, or when there are multiple possible matches that gives an out put of two possible options like most likely it's an 8(78% match), but it could also be a 3 (11%) or a 0 (11%).

So I honestly don't even know if this is even possible, I've read some things about OCR and Anomaly Detection but i'm not sure how I could implement these techniques to reach my goal.

Could somebody shine some light on this?

Thank you in advance :)

This is the image
[Image: view?usp=sharing]

**Larz60+** · Feb-03-2020, 10:08 PM

with pytesseract which is a wrapper around Google’s Tesseract-OCR Engine at: https://github.com/tesseract-ocr/tesseract
you can include a config file that can be tweeked to try and get a better result. I'm not an expert, so can only point you in the direction of the application, see: https://pypi.org/project/pytesseract/

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Deleting characters between certain characters	stahorse	7	1,131	Jul-03-2023, 12:59 AM Last Post: Pedroski55

Creating a code to make unreadable pdf characters readable

User Panel Messages

Announcements