Fairly well explained here. In Python2, file.read() reads bytes as single-byte characters that may have to real meaning. If you use characters that aren't in the ASCII set (ASCII codes up to 127, which excludes accented characters) you have to use the 'unicode' type that behaves like a string but can contain non-ASCII characters. To go from the string of single byte to unicode you decode it:
You may wonder why the Unicode string looks like the ISO one. It's an optical illusion. Of course the people who defined Unicode didn't completely reinvent the wheel, and integrated as many existing encoding as feasible. So the numbers that encode characters in ISO-8859 and Unicode can be the same. However, the first is a one-byte 0xe9 and the second is really the Unicode +00E9.
Needless to say, this means that you have to know in advance the encoding used to encode the files... On the other hand, there aren't that many encodings for Romance languages, so it will likely be either UTF-8 or some variant of ISO-8859.
# read in the file contents iso=open('iso-8859-15.txt').read() utf=open('utf-8.txt').read() # this is how they look, one <str> character for each byte in the source file print 'ISO:', repr(iso) print 'UTF:', repr(utf) # transform them to unicode, specifying the appropriate encoding unicodeISO=unicode(iso,encoding='iso-8859-15') unicodeUTF=unicode(utf,encoding='UTF-8') # Now, as unicode strings, they are identical print repr(unicodeISO),unicodeISO print repr(unicodeUTF),unicodeUTF(the two data files attached)
You may wonder why the Unicode string looks like the ISO one. It's an optical illusion. Of course the people who defined Unicode didn't completely reinvent the wheel, and integrated as many existing encoding as feasible. So the numbers that encode characters in ISO-8859 and Unicode can be the same. However, the first is a one-byte 0xe9 and the second is really the Unicode +00E9.
Needless to say, this means that you have to know in advance the encoding used to encode the files... On the other hand, there aren't that many encodings for Romance languages, so it will likely be either UTF-8 or some variant of ISO-8859.
Attached Files
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Your one-stop place for all your GIMP needs: gimp-forum.net