Python Forum
Ask help for utf-8 decode/encode
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Ask help for utf-8 decode/encode
#1
Hello, I am new here in this forum. 

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be 
'pantry购物'
which is an Chinese sentence. 

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website. 

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence. 

Can anyone here help me?

Thank you in advance.
Reply
#2
As far as I know, you must deal with each language separately
Chinese: https://docs.python.org/3.4/library/code...20encoding
Japanese: https://docs.python.org/3.4/library/code...20encoding
korean https://docs.python.org/3.4/library/code...20encoding

(may all lead to same highlights on page)
Reply
#3
(Feb-06-2017, 04:27 AM)Larz60+ Wrote: As far as I know, you must deal with each language separately
Chinese: https://docs.python.org/3.4/library/code...20encoding
Japanese: https://docs.python.org/3.4/library/code...20encoding
korean https://docs.python.org/3.4/library/code...20encoding

(may all lead to same highlights on page)

Thank you so much for the quick reply!

I am not very sure, however, CJK is supposed to be unified under utf-8 framework. Am I wrong?
Reply
#4
yes, CJK is well defined in Unicode.   then Unicode is encoded into UTF-8 where needed (octet environments).  a raw interface into a database that stores 8 bit bytes is one such environment.  using Python 3 for this kind of thing is highly advised.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
(Feb-06-2017, 05:17 AM)Skaperen Wrote: yes, CJK is well defined in Unicode.   then Unicode is encoded into UTF-8 where needed (octet environments).  a raw interface into a database that stores 8 bit bytes is one such environment.  using Python 3 for this kind of thing is highly advised.

Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?
Reply
#6
(Feb-06-2017, 05:35 AM)forfan Wrote:
(Feb-06-2017, 05:17 AM)Skaperen Wrote: yes, CJK is well defined in Unicode.   then Unicode is encoded into UTF-8 where needed (octet environments).  a raw interface into a database that stores 8 bit bytes is one such environment.  using Python 3 for this kind of thing is highly advised.

Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?

In Python 2, there are strings (characters encoded on 8 bit) and unicodes (characters encoded on 32 bits)

In Python 3, there are only strings but all characters are "wide" (encoded on 32 bits). And there are byte sequences for when you read things in binary (reading files with "b" mode, getting stuff throuh a network, etc...).
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#7
(Feb-06-2017, 02:04 PM)Ofnuts Wrote:
(Feb-06-2017, 05:35 AM)forfan Wrote: Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?

In Python 2, there are strings (characters encoded on 8 bit) and unicodes (characters encoded on 32 bits)

In Python 3, there are only strings but all characters are "wide" (encoded on 32 bits). And there are byte sequences for when you read things in binary (reading files with "b" mode, getting stuff throuh a network, etc...).

Thank you. Will try python 3 to see whether it works
Reply
#8
(Feb-06-2017, 01:57 AM)forfan Wrote: Hello, I am new here in this forum. 

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be 
'pantry购物'
which is an Chinese sentence. 

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website. 

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence. 

Can anyone here help me?

Thank you in advance.
All depends if 'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9' is a representation (ie, repr) or a string (str).

However, if you take a closer look, in UTF-8:
  • C3A8 encodes 'E8'
  • C2B4 encodes 'B4'
  • C2AD encodes 'AD'

And, also in UTF-8, 'E8B4AD' encodes '8D2D' which is coincidentally the unicode for . So I would guess that you have one UTF-8 encoding too many here.

Note: when dealing with characters encoding, it is very easy to get fooled by the terminal output. Checking lengths in bytes and characters usually lifts ambiguities and using repr() instead of str() usually helps.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#9
separating sequences of characters (typically implemented with 32 bits per character) and sequences of bytes (operates as 8 bits per byte) is what is needed to make sure the encoding/decoding relationships are done right.  python 3 has this.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#10
(Feb-06-2017, 02:28 PM)Ofnuts Wrote:
(Feb-06-2017, 01:57 AM)forfan Wrote: Hello, I am new here in this forum. 

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be 
'pantry购物'
which is an Chinese sentence. 

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website. 

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence. 

Can anyone here help me?

Thank you in advance.
All depends if 'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9' is a representation (ie, repr) or a string (str).

However, if you take a closer look, in UTF-8:
  • C3A8 encodes 'E8'
  • C2B4 encodes 'B4'
  • C2AD encodes 'AD'

And, also in UTF-8, 'E8B4AD' encodes '8D2D' which is coincidentally the unicode for . So I would guess that you have one UTF-8 encoding too many here.

Note: when dealing with characters encoding, it is very easy to get fooled by the terminal output. Checking lengths in bytes and characters usually lifts ambiguities and using repr() instead of str() usually helps.

Thank you very much.
I think the text retrieved from database is string. 

I am not familiar with encode/decode issues. If I put the question in another way, suppose that I have a string
'\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
I want print this string as an encoded utf-8 character and shown in output as "购物", how can I do it?

(Feb-07-2017, 01:32 AM)Skaperen Wrote: separating sequences of characters (typically implemented with 32 bits per character) and sequences of bytes (operates as 8 bits per byte) is what is needed to make sure the encoding/decoding relationships are done right.  python 3 has this.

Thank you !

Will definitely give python 3 a shot.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  encode/decode to show correct country letters in a CTk combobox janeik 2 717 Sep-02-2023, 09:46 AM
Last Post: janeik
Question UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord ctrldan 23 4,824 Apr-24-2023, 03:40 PM
Last Post: ctrldan
  Decode string ? JohnnyCoffee 1 820 Jan-11-2023, 12:29 AM
Last Post: bowlofred
  UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in position 14: ordin Armandito 6 2,732 Apr-29-2022, 12:36 PM
Last Post: Armandito
  'NoneType' object has no attribute 'encode' bhagyashree 6 8,868 Nov-05-2020, 03:50 PM
Last Post: deanhystad
  how to encode and decode same value absolut 2 2,348 Sep-08-2020, 09:46 AM
Last Post: TomToad
  TypeError: ENCODE Method, str instead of byte Rajath 1 2,772 May-09-2020, 06:05 PM
Last Post: bowlofred
  struct.decode() and '\0' deanhystad 1 3,228 Apr-09-2020, 04:13 PM
Last Post: TomToad
  Getting decode error. shankar 8 10,385 Sep-20-2019, 10:05 AM
Last Post: tinman
  asyncio encode and decode, how good are they? CoderOne 2 2,334 Sep-03-2019, 11:06 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020