Python Forum

Full Version: Unidecode issue
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,
In some pdfs I encounter references to the original parish register, like so: ref = ' RP 477; p. 148 r° '
I perform unidecode on all strings in the document : fieldUni = unidecode.unidecode(field).upper()

This has never caused any problems, except in the above case, when i get this: ' RP 477; P. 148 RDEG '

The " ° " has been "translated" into DEG. That is not what is meant here.

How do I avoid this translation in python (other then a manual ctrl-H replace '°' with ... etc.) in the text document?
thx,
Paul
(Sep-02-2023, 06:42 AM)DPaul Wrote: [ -> ]How do I avoid this translation in python (other then a manual ctrl-H replace '°' with ... etc.) in the text document?
Which translation do you want instead of replacing '°' with 'deg'
(Sep-02-2023, 08:45 AM)Gribouillis Wrote: [ -> ]Which translation do you want instead
Fair question.
Let me do some research, because I have to find out if the 'degrees' symbol
was meant to be there and has some genealogy meaning.
Or is it a faulty translation of something earlier, if the original text was eg. in access of lotus 123..
Paul
(Sep-02-2023, 08:45 AM)Gribouillis Wrote: [ -> ]Which translation do you want instead
OK, there is a hidden meaning , only known to genealogists I suppose.
148 is the folio nr.
r° is recto , and...
v° means verso.
So, recto, verso would be the right translations.
I have checked the document, and indeed, some records are r°, others v°
?
Paul
Use re.sub() for example
>>> import re
>>> dic = {'r°': 'recto', 'v°': 'verso'}
>>> def repl(match):
...     return dic[match.group(0)]
... 
>>> s = ' RP 477; p. 148 r° '
>>> 
>>> re.sub('[rv]°', repl, s)
' RP 477; p. 148 recto '
(Sep-03-2023, 06:20 PM)Gribouillis Wrote: [ -> ]Use re.sub() for example
I thought I had to fiddle around with unidecode parameters,
but this is nice and concise.
Thanks again,
Paul