Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
OCR question
#1
OCR with tesseract does a very good job, we know that.
I use it to process various types of documents, some of them are just lists of people.
About a 100 years ago, people started to use typewriters, and did
not always refresh the ribbon in time, or used carbon copy ("cc") resulting in very faint text.
So tesseract, if it can't decypher whats there, comes up with random sequences of letters, like:
"... GGZ|OSEPH|SSSSSSSSF|MFIAFIFIAFIFDE|ADRUARN|IFIIFIA|FFLF|WFFI|ZFFIJFIA ..."
The pipes are things I put between detected words.
Can anybody think of a clever way to reject these words?
We're talking hundreds of thousands of lines., and some of them contain these "random" sequences.
One partial solution I thought of was to detect eg. groups of 3 identical letters ... sometimes that happens..
Any python module that I never heard of maybe? "Anti-gibberish" module?
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020