OCR question

DPaul · (This post was last modified: Mar-29-2024, 09:46 AM by DPaul.)

OCR with tesseract does a very good job, we know that.
I use it to process various types of documents, some of them are just lists of people.
About a 100 years ago, people started to use typewriters, and did
not always refresh the ribbon in time, or used carbon copy ("cc") resulting in very faint text.
So tesseract, if it can't decypher whats there, comes up with random sequences of letters, like:
"... GGZ|OSEPH|SSSSSSSSF|MFIAFIFIAFIFDE|ADRUARN|IFIIFIA|FFLF|WFFI|ZFFIJFIA ..."
The pipes are things I put between detected words.
Can anybody think of a clever way to reject these words?
We're talking hundreds of thousands of lines., and some of them contain these "random" sequences.
One partial solution I thought of was to detect eg. groups of 3 identical letters ... sometimes that happens..
Any python module that I never heard of maybe? "Anti-gibberish" module?
thx,
Paul

OCR question

User Panel Messages

Announcements