Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
More OCR
#1
Hi,
It happens that some prayer cards are in poor condition or have poor print quality.
Amongst the zillions processed, they do not always stand out.

When examining the OCR text result, a word is sometimes
returned as "gibberish"; but you cannot explain to python what is gibberish and what not.
(Some people had "strange" names or lived in "strange" places)

Except: when a returned word has more than 5 the same letters in a row. Like : "ENTNTENSNNNMNNNINNNSNINNE".
In some languages (European) 4 the same are possible, but I don't think 5. Even with 4, I only know of one example.
Hence my question: how could I efficienty discover that a word has 5 identical letters (A-Z in capitals) in a row.

Yes i can do : lstLetters = ['AAAAA','BBBBB' ...] for letters in lstLetters: ... if letters in word ... etc.
Something faster and cleverer maybe ?
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
(May-14-2023, 06:54 AM)DPaul Wrote: how could I efficienty discover that a word has 5 identical letters (A-Z in capitals) in a row.
The problem is we don't know what efficiently means. You can do it with the re module for example
>>> import re
>>> word = "ENTNTENSNNNMNNNINNNNNSNINNE"
>>> re.search(r'([A-Z])\1{4}', word)
<re.Match object; span=(16, 21), match='NNNNN'>
Reply
#3
Hi Gribouilis,
I have applied your suggestion, with these observations:
- I changed re.search into x = re.findall ...
- Then I could test on x not empty.
I ran the code on 135_281 text lines, it examined 22_752_731 words and found 23_303 matches.
Total time : 13 seconds.
I call that efficient. Thanks.
What I don't understand: (r'([A-Z])\1{4}', word)
Does {4} mean : find one and then 4 more, totalling five ?
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#4
(May-15-2023, 07:27 AM)DPaul Wrote: - Then I could test on x not empty.
If there is no match, re.search will return None, so you can test directly
if re.search(...):
    ...
(May-15-2023, 07:27 AM)DPaul Wrote: Does {4} mean : find one and then 4 more, totalling five ?
Exactly, see the regular expression syntax. The ([A-Z]) finds one uppercase letter and makes this the first group (because of the parentheses), then \1{4} matches if there are 4 repetitions of the group number 1.
Reply
#5
Ok, tested on "None" and the timing is 1 second less !
(I have an O'Reilly pocket book on regex, but I did not explain {x} syntax)
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#6
By the way:
Did anybody ever write a wizard for regex.
Seems to me that would come in handy ... Wink
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#7
(May-15-2023, 07:53 AM)DPaul Wrote: Did anybody ever write a wizard for regex.
There are online regex debuggers such as https://regex101.com. For offline use, you can try the good old kodos dating back from python 1.5, rewritten for python 3.

The re module still moves forward, Python 3.11 adds more syntax that these debuggers may not yet implement.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020