More OCR

DPaul · May-14-2023, 06:54 AM

Hi,
It happens that some prayer cards are in poor condition or have poor print quality.
Amongst the zillions processed, they do not always stand out.

When examining the OCR text result, a word is sometimes
returned as "gibberish"; but you cannot explain to python what is gibberish and what not.
(Some people had "strange" names or lived in "strange" places)

Except: when a returned word has more than 5 the same letters in a row. Like : "ENTNTENSNNNMNNNINNNSNINNE".
In some languages (European) 4 the same are possible, but I don't think 5. Even with 4, I only know of one example.
Hence my question: how could I efficienty discover that a word has 5 identical letters (A-Z in capitals) in a row.

Yes i can do : lstLetters = ['AAAAA','BBBBB' ...] for letters in lstLetters: ... if letters in word ... etc.
Something faster and cleverer maybe ?
thx,
Paul

**Gribouillis** · May-15-2023, 06:29 AM

(May-14-2023, 06:54 AM)DPaul Wrote: how could I efficienty discover that a word has 5 identical letters (A-Z in capitals) in a row.

The problem is we don't know what efficiently means. You can do it with the re module for example

>>> import re
>>> word = "ENTNTENSNNNMNNNINNNNNSNINNE"
>>> re.search(r'([A-Z])\1{4}', word)
<re.Match object; span=(16, 21), match='NNNNN'>

DPaul · May-15-2023, 07:27 AM

Hi Gribouilis,
I have applied your suggestion, with these observations:
- I changed re.search into x = re.findall ...
- Then I could test on x not empty.
I ran the code on 135_281 text lines, it examined 22_752_731 words and found 23_303 matches.
Total time : 13 seconds.
I call that efficient. Thanks.
What I don't understand: (r'([A-Z])\1{4}', word)
Does {4} mean : find one and then 4 more, totalling five ?
thx,
Paul

**Gribouillis** · (This post was last modified: May-15-2023, 07:37 AM by Gribouillis.)

(May-15-2023, 07:27 AM)DPaul Wrote: - Then I could test on x not empty.

If there is no match, re.search will return None, so you can test directly

if re.search(...):
    ...

(May-15-2023, 07:27 AM)DPaul Wrote: Does {4} mean : find one and then 4 more, totalling five ?

Exactly, see the regular expression syntax. The ([A-Z]) finds one uppercase letter and makes this the first group (because of the parentheses), then \1{4} matches if there are 4 repetitions of the group number 1.

DPaul · May-15-2023, 07:48 AM

Ok, tested on "None" and the timing is 1 second less !
(I have an O'Reilly pocket book on regex, but I did not explain {x} syntax)
thx,
Paul

DPaul · May-15-2023, 07:53 AM

By the way:
Did anybody ever write a wizard for regex.
Seems to me that would come in handy ... Wink

Paul

**Gribouillis** · (This post was last modified: May-15-2023, 08:49 AM by Gribouillis.)

(May-15-2023, 07:53 AM)DPaul Wrote: Did anybody ever write a wizard for regex.

There are online regex debuggers such as https://regex101.com. For offline use, you can try the good old kodos dating back from python 1.5, rewritten for python 3.

The re module still moves forward, Python 3.11 adds more syntax that these debuggers may not yet implement.

More OCR

User Panel Messages

Announcements