Python Forum

Full Version: Ask help for utf-8 decode/encode
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hello, I am new here in this forum. 

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be 
'pantry购物'
which is an Chinese sentence. 

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website. 

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence. 

Can anyone here help me?

Thank you in advance.
As far as I know, you must deal with each language separately
Chinese: https://docs.python.org/3.4/library/code...20encoding
Japanese: https://docs.python.org/3.4/library/code...20encoding
korean https://docs.python.org/3.4/library/code...20encoding

(may all lead to same highlights on page)
(Feb-06-2017, 04:27 AM)Larz60+ Wrote: [ -> ]As far as I know, you must deal with each language separately
Chinese: https://docs.python.org/3.4/library/code...20encoding
Japanese: https://docs.python.org/3.4/library/code...20encoding
korean https://docs.python.org/3.4/library/code...20encoding

(may all lead to same highlights on page)

Thank you so much for the quick reply!

I am not very sure, however, CJK is supposed to be unified under utf-8 framework. Am I wrong?
yes, CJK is well defined in Unicode.   then Unicode is encoded into UTF-8 where needed (octet environments).  a raw interface into a database that stores 8 bit bytes is one such environment.  using Python 3 for this kind of thing is highly advised.
(Feb-06-2017, 05:17 AM)Skaperen Wrote: [ -> ]yes, CJK is well defined in Unicode.   then Unicode is encoded into UTF-8 where needed (octet environments).  a raw interface into a database that stores 8 bit bytes is one such environment.  using Python 3 for this kind of thing is highly advised.

Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?
(Feb-06-2017, 05:35 AM)forfan Wrote: [ -> ]
(Feb-06-2017, 05:17 AM)Skaperen Wrote: [ -> ]yes, CJK is well defined in Unicode.   then Unicode is encoded into UTF-8 where needed (octet environments).  a raw interface into a database that stores 8 bit bytes is one such environment.  using Python 3 for this kind of thing is highly advised.

Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?

In Python 2, there are strings (characters encoded on 8 bit) and unicodes (characters encoded on 32 bits)

In Python 3, there are only strings but all characters are "wide" (encoded on 32 bits). And there are byte sequences for when you read things in binary (reading files with "b" mode, getting stuff throuh a network, etc...).
(Feb-06-2017, 02:04 PM)Ofnuts Wrote: [ -> ]
(Feb-06-2017, 05:35 AM)forfan Wrote: [ -> ]Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?

In Python 2, there are strings (characters encoded on 8 bit) and unicodes (characters encoded on 32 bits)

In Python 3, there are only strings but all characters are "wide" (encoded on 32 bits). And there are byte sequences for when you read things in binary (reading files with "b" mode, getting stuff throuh a network, etc...).

Thank you. Will try python 3 to see whether it works
(Feb-06-2017, 01:57 AM)forfan Wrote: [ -> ]Hello, I am new here in this forum. 

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be 
'pantry购物'
which is an Chinese sentence. 

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website. 

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence. 

Can anyone here help me?

Thank you in advance.
All depends if 'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9' is a representation (ie, repr) or a string (str).

However, if you take a closer look, in UTF-8:
  • C3A8 encodes 'E8'
  • C2B4 encodes 'B4'
  • C2AD encodes 'AD'

And, also in UTF-8, 'E8B4AD' encodes '8D2D' which is coincidentally the unicode for . So I would guess that you have one UTF-8 encoding too many here.

Note: when dealing with characters encoding, it is very easy to get fooled by the terminal output. Checking lengths in bytes and characters usually lifts ambiguities and using repr() instead of str() usually helps.
separating sequences of characters (typically implemented with 32 bits per character) and sequences of bytes (operates as 8 bits per byte) is what is needed to make sure the encoding/decoding relationships are done right.  python 3 has this.
(Feb-06-2017, 02:28 PM)Ofnuts Wrote: [ -> ]
(Feb-06-2017, 01:57 AM)forfan Wrote: [ -> ]Hello, I am new here in this forum. 

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be 
'pantry购物'
which is an Chinese sentence. 

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website. 

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence. 

Can anyone here help me?

Thank you in advance.
All depends if 'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9' is a representation (ie, repr) or a string (str).

However, if you take a closer look, in UTF-8:
  • C3A8 encodes 'E8'
  • C2B4 encodes 'B4'
  • C2AD encodes 'AD'

And, also in UTF-8, 'E8B4AD' encodes '8D2D' which is coincidentally the unicode for . So I would guess that you have one UTF-8 encoding too many here.

Note: when dealing with characters encoding, it is very easy to get fooled by the terminal output. Checking lengths in bytes and characters usually lifts ambiguities and using repr() instead of str() usually helps.

Thank you very much.
I think the text retrieved from database is string. 

I am not familiar with encode/decode issues. If I put the question in another way, suppose that I have a string
'\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
I want print this string as an encoded utf-8 character and shown in output as "购物", how can I do it?

(Feb-07-2017, 01:32 AM)Skaperen Wrote: [ -> ]separating sequences of characters (typically implemented with 32 bits per character) and sequences of bytes (operates as 8 bits per byte) is what is needed to make sure the encoding/decoding relationships are done right.  python 3 has this.

Thank you !

Will definitely give python 3 a shot.
Pages: 1 2