Python Forum
Unicode problem - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Unicode problem (/thread-24348.html)



Unicode problem - Hobson - Feb-10-2020

I need to update one system with data from another. The source file uses utf-8 encoding and the destination system requires iso-8859-1 encoding. No problem, I open the files I am reading with utf-8 encoding and the files I am writing to with iso-8859-1 encoding. This worked fine until recently when I started getting unicode errors when writing. The offending string was 'Birleşik Krallık’.

It seems that iso-8859-1 cannot recognise all of the characters that are recognised by utf-8. This cannot be a new issue and I am sure there must be a solution out there. Is there some way I can make my program bulletproof? For example is there a table somewhere that can convert characters that iso-8859-1 cannot recognise into ones they can?

I am new to Unicode so a simple reply would be much appreciated. Many thanks.


RE: Unicode problem - snippsat - Feb-10-2020

Can try Unidecode.
So if catch UnicodeEncodeError then unidecode and apply iso-8859-1.
from unidecode import unidecode

s = 'hi my name is Birleşik Krallık'
#s = 'hello world'
try:
    text = s.encode('iso-8859-1')
except UnicodeEncodeError:
    text = unidecode(s).encode('iso-8859-1')

print(text)
Output:
b'hi my name is Birlesik Krallik'



RE: Unicode problem - Hobson - Feb-10-2020

Thank you. I am trapping Unicode errors when writing the records so the program does not crash. For various reasons I would prefer not to install anything that does not come as standard. Is there another way?

If not then am I correct in thinking that if the UnicodeEncodeError occurs in your code I would need a second line of the type:
text = text.decode()
to convert text from bytes to a string?


RE: Unicode problem - snippsat - Feb-10-2020

(Feb-10-2020, 12:22 PM)Hobson Wrote: For various reasons I would prefer not to install anything that does not come as standard
This should nowadays not be a good reason at all,as pip come with Python an work in all kind of environments.

There is errors parameter that can use,this will work but get some missing character.
There is ignore or replace.
>>> s = 'Birleşik Krallık'
>>> s.encode('iso-8859-1', errors='ignore')
b'Birleik Krallk'
>>> 
>>> s.encode('iso-8859-1', errors='replace')
b'Birle?ik Krall?k'
Quote:If not then am I correct in thinking that if the UnicodeEncodeError occurs in your code I would need a second line of the type:
text = text.decode()
to convert text from bytes to a string?
Yes you only get in trouble because you try to encode to iso-8859-1,then it most be bytes.
To get back to sting most decode.
>>> s = 'hello'
>>> s = s.encode() #Same as encode('utf-8')
>>> s
b'hello'
>>> s.decode() #Same as decode('utf-8') 
'hello'

>>> s = 'hello'
>>> s = s.encode('iso-8859-1') # Give it a other encoding than utf-8
>>> s
b'hello'
>>> s.decode('iso-8859-1') #Or just decode() would work in this case 
'hello'
Only see the difference if there is a Unicode character
>> s = 'helloø'
>>> s = s.encode() 
>>> s
b'hello\xc3\xb8'
>>> 
>>> s = 'helloø'
>>> s = s.encode('iso-8859-1') 
>>> s
b'hello\xf8'

>>> s.decode() # Now utf-8 back will fail
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 5: invalid start byte

>>> s.decode('iso-8859-1') # Same encoding an it works
'helloø'



RE: Unicode problem - Hobson - Feb-10-2020

Brilliant. Thank you. Very helpful.