Feb-03-2020, 09:10 PM
(This post was last modified: Feb-03-2020, 09:11 PM by Tegendraads.)
Hi There!
I just got into machine learning and data science and was wondering if it would be possible to create a code for the following:
In my daily work I have to sift trough thousands of old documents that where scannend at some point back in the early 2000's.
A lot of these documents are missing crucial data, simply because the scanning process was sloppy and not of good quality.
I was looking to create a code to recognize the damaged text (mostly numbers and practically all in the same font) and then comparing them against a known set of good quality characters. So the purpose of this code would be to make unreadable characters readable again if the code finds a (partial or good enough match) with what the character should be). I've added an example of what the texts look like in different occasions.
Note:
The Lorem Ipsum text is not really important, because it's almost always generic, it's really about the document numbers and dates.
So in the example image A and B are not that damaged and quote readable. And C and D are almost impossible to read and if the text is damaged really bad, then a 0 could be a 0 an 8 a 6 a 5 ir a 3 maybe even a 2. So it would be great if there was a way to compare the images (maybe pixels) and layer them on top of each other to match the data and that it should be able to output this information as a true or false statement, or when there are multiple possible matches that gives an out put of two possible options like most likely it's an 8(78% match), but it could also be a 3 (11%) or a 0 (11%).
So I honestly don't even know if this is even possible, I've read some things about OCR and Anomaly Detection but i'm not sure how I could implement these techniques to reach my goal.
Could somebody shine some light on this?
Thank you in advance :)
This is the image
I just got into machine learning and data science and was wondering if it would be possible to create a code for the following:
In my daily work I have to sift trough thousands of old documents that where scannend at some point back in the early 2000's.
A lot of these documents are missing crucial data, simply because the scanning process was sloppy and not of good quality.
I was looking to create a code to recognize the damaged text (mostly numbers and practically all in the same font) and then comparing them against a known set of good quality characters. So the purpose of this code would be to make unreadable characters readable again if the code finds a (partial or good enough match) with what the character should be). I've added an example of what the texts look like in different occasions.
Note:
The Lorem Ipsum text is not really important, because it's almost always generic, it's really about the document numbers and dates.
So in the example image A and B are not that damaged and quote readable. And C and D are almost impossible to read and if the text is damaged really bad, then a 0 could be a 0 an 8 a 6 a 5 ir a 3 maybe even a 2. So it would be great if there was a way to compare the images (maybe pixels) and layer them on top of each other to match the data and that it should be able to output this information as a true or false statement, or when there are multiple possible matches that gives an out put of two possible options like most likely it's an 8(78% match), but it could also be a 3 (11%) or a 0 (11%).
So I honestly don't even know if this is even possible, I've read some things about OCR and Anomaly Detection but i'm not sure how I could implement these techniques to reach my goal.
Could somebody shine some light on this?
Thank you in advance :)
This is the image