Ask help for utf-8 decode/encode

forfan · Feb-06-2017, 01:57 AM

Hello, I am new here in this forum.

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be
'pantry购物'
which is an Chinese sentence.

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website.

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence.

Can anyone here help me?

Thank you in advance.

**Larz60+** · (This post was last modified: Feb-06-2017, 04:27 AM by Larz60+.)

As far as I know, you must deal with each language separately
Chinese: https://docs.python.org/3.4/library/code...20encoding
Japanese: https://docs.python.org/3.4/library/code...20encoding
korean https://docs.python.org/3.4/library/code...20encoding

(may all lead to same highlights on page)

forfan · Feb-06-2017, 04:39 AM

(Feb-06-2017, 04:27 AM)Larz60+ Wrote: As far as I know, you must deal with each language separately
Chinese: https://docs.python.org/3.4/library/code...20encoding
Japanese: https://docs.python.org/3.4/library/code...20encoding
korean https://docs.python.org/3.4/library/code...20encoding

(may all lead to same highlights on page)

Thank you so much for the quick reply!

I am not very sure, however, CJK is supposed to be unified under utf-8 framework. Am I wrong?

Skaperen · (This post was last modified: Feb-06-2017, 05:18 AM by Skaperen.)

yes, CJK is well defined in Unicode. then Unicode is encoded into UTF-8 where needed (octet environments). a raw interface into a database that stores 8 bit bytes is one such environment. using Python 3 for this kind of thing is highly advised.

forfan · Feb-06-2017, 05:35 AM

(Feb-06-2017, 05:17 AM)Skaperen Wrote: yes, CJK is well defined in Unicode. then Unicode is encoded into UTF-8 where needed (octet environments). a raw interface into a database that stores 8 bit bytes is one such environment. using Python 3 for this kind of thing is highly advised.

Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?

***Ofnuts*** · Feb-06-2017, 02:04 PM

(Feb-06-2017, 05:35 AM)forfan Wrote:
(Feb-06-2017, 05:17 AM)Skaperen Wrote: yes, CJK is well defined in Unicode. then Unicode is encoded into UTF-8 where needed (octet environments). a raw interface into a database that stores 8 bit bytes is one such environment. using Python 3 for this kind of thing is highly advised.

Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?

In Python 2, there are strings (characters encoded on 8 bit) and unicodes (characters encoded on 32 bits)

In Python 3, there are only strings but all characters are "wide" (encoded on 32 bits). And there are byte sequences for when you read things in binary (reading files with "b" mode, getting stuff throuh a network, etc...).

forfan · Feb-06-2017, 02:06 PM

(Feb-06-2017, 02:04 PM)Ofnuts Wrote:
(Feb-06-2017, 05:35 AM)forfan Wrote: Hi, Skaperen

Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?

In Python 2, there are strings (characters encoded on 8 bit) and unicodes (characters encoded on 32 bits)

In Python 3, there are only strings but all characters are "wide" (encoded on 32 bits). And there are byte sequences for when you read things in binary (reading files with "b" mode, getting stuff throuh a network, etc...).

Thank you. Will try python 3 to see whether it works

***Ofnuts*** · Feb-06-2017, 02:28 PM

(Feb-06-2017, 01:57 AM)forfan Wrote: Hello, I am new here in this forum.

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be
'pantry购物'
which is an Chinese sentence.

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website.

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence.

Can anyone here help me?

Thank you in advance.

All depends if 'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9' is a representation (ie, repr) or a string (str).

However, if you take a closer look, in UTF-8:

C3A8 encodes 'E8'
C2B4 encodes 'B4'
C2AD encodes 'AD'

And, also in UTF-8, 'E8B4AD' encodes '8D2D' which is coincidentally the unicode for 购. So I would guess that you have one UTF-8 encoding too many here.

Note: when dealing with characters encoding, it is very easy to get fooled by the terminal output. Checking lengths in bytes and characters usually lifts ambiguities and using repr() instead of str() usually helps.

Skaperen · (This post was last modified: Feb-07-2017, 01:32 AM by Skaperen.)

separating sequences of characters (typically implemented with 32 bits per character) and sequences of bytes (operates as 8 bits per byte) is what is needed to make sure the encoding/decoding relationships are done right. python 3 has this.

forfan · (This post was last modified: Feb-07-2017, 05:49 AM by forfan.)

(Feb-06-2017, 02:28 PM)Ofnuts Wrote:
(Feb-06-2017, 01:57 AM)forfan Wrote: Hello, I am new here in this forum.

I am current working on a python program manipulating a Mysql database.

The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be
'pantry购物'
which is an Chinese sentence.

The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website.

I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence.

Can anyone here help me?

Thank you in advance.
All depends if 'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9' is a representation (ie, repr) or a string (str).

However, if you take a closer look, in UTF-8:
C3A8 encodes 'E8'

C2B4 encodes 'B4'

C2AD encodes 'AD'

And, also in UTF-8, 'E8B4AD' encodes '8D2D' which is coincidentally the unicode for 购. So I would guess that you have one UTF-8 encoding too many here.

Note: when dealing with characters encoding, it is very easy to get fooled by the terminal output. Checking lengths in bytes and characters usually lifts ambiguities and using repr() instead of str() usually helps.

Thank you very much.
I think the text retrieved from database is string.

I am not familiar with encode/decode issues. If I put the question in another way, suppose that I have a string
'\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
I want print this string as an encoded utf-8 character and shown in output as "购物", how can I do it?

(Feb-07-2017, 01:32 AM)Skaperen Wrote: separating sequences of characters (typically implemented with 32 bits per character) and sequences of bytes (operates as 8 bits per byte) is what is needed to make sure the encoding/decoding relationships are done right. python 3 has this.

Thank you !

Will definitely give python 3 a shot.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	encode/decode to show correct country letters in a CTk combobox	janeik	2	717	Sep-02-2023, 09:46 AM Last Post: janeik
	UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord	ctrldan	23	4,824	Apr-24-2023, 03:40 PM Last Post: ctrldan
	Decode string ?	JohnnyCoffee	1	820	Jan-11-2023, 12:29 AM Last Post: bowlofred
	UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in position 14: ordin	Armandito	6	2,732	Apr-29-2022, 12:36 PM Last Post: Armandito
	'NoneType' object has no attribute 'encode'	bhagyashree	6	8,868	Nov-05-2020, 03:50 PM Last Post: deanhystad
	how to encode and decode same value	absolut	2	2,348	Sep-08-2020, 09:46 AM Last Post: TomToad
	TypeError: ENCODE Method, str instead of byte	Rajath	1	2,772	May-09-2020, 06:05 PM Last Post: bowlofred
	struct.decode() and '\0'	deanhystad	1	3,228	Apr-09-2020, 04:13 PM Last Post: TomToad
	Getting decode error.	shankar	8	10,385	Sep-20-2019, 10:05 AM Last Post: tinman
	asyncio encode and decode, how good are they?	CoderOne	2	2,334	Sep-03-2019, 11:06 PM Last Post: wavic

Ask help for utf-8 decode/encode

User Panel Messages

Announcements