Posts: 7
Threads: 1
Joined: Feb 2017
Hello, I am new here in this forum.
I am current working on a python program manipulating a Mysql database.
The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be
'pantry购物'
which is an Chinese sentence.
The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website.
I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence.
Can anyone here help me?
Thank you in advance.
Posts: 12,047
Threads: 487
Joined: Sep 2016
Feb-06-2017, 04:27 AM
(This post was last modified: Feb-06-2017, 04:27 AM by Larz60+.)
Posts: 7
Threads: 1
Joined: Feb 2017
(Feb-06-2017, 04:27 AM)Larz60+ Wrote: As far as I know, you must deal with each language separately
Chinese: https://docs.python.org/3.4/library/code...20encoding
Japanese: https://docs.python.org/3.4/library/code...20encoding
korean https://docs.python.org/3.4/library/code...20encoding
(may all lead to same highlights on page)
Thank you so much for the quick reply!
I am not very sure, however, CJK is supposed to be unified under utf-8 framework. Am I wrong?
Posts: 4,654
Threads: 1,497
Joined: Sep 2016
Feb-06-2017, 05:17 AM
(This post was last modified: Feb-06-2017, 05:18 AM by Skaperen.)
yes, CJK is well defined in Unicode. then Unicode is encoded into UTF-8 where needed (octet environments). a raw interface into a database that stores 8 bit bytes is one such environment. using Python 3 for this kind of thing is highly advised.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 7
Threads: 1
Joined: Feb 2017
(Feb-06-2017, 05:17 AM)Skaperen Wrote: yes, CJK is well defined in Unicode. then Unicode is encoded into UTF-8 where needed (octet environments). a raw interface into a database that stores 8 bit bytes is one such environment. using Python 3 for this kind of thing is highly advised.
Hi, Skaperen
Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?
Posts: 687
Threads: 37
Joined: Sep 2016
(Feb-06-2017, 05:35 AM)forfan Wrote: (Feb-06-2017, 05:17 AM)Skaperen Wrote: yes, CJK is well defined in Unicode. then Unicode is encoded into UTF-8 where needed (octet environments). a raw interface into a database that stores 8 bit bytes is one such environment. using Python 3 for this kind of thing is highly advised.
Hi, Skaperen
Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?
In Python 2, there are strings (characters encoded on 8 bit) and unicodes (characters encoded on 32 bits)
In Python 3, there are only strings but all characters are "wide" (encoded on 32 bits). And there are byte sequences for when you read things in binary (reading files with "b" mode, getting stuff throuh a network, etc...).
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Posts: 7
Threads: 1
Joined: Feb 2017
(Feb-06-2017, 02:04 PM)Ofnuts Wrote: (Feb-06-2017, 05:35 AM)forfan Wrote: Hi, Skaperen
Thank you.
Does you mean using python 3 could solve the problem? What difference between 2.7 and 3 in manipulating utf-8 related issues?
In Python 2, there are strings (characters encoded on 8 bit) and unicodes (characters encoded on 32 bits)
In Python 3, there are only strings but all characters are "wide" (encoded on 32 bits). And there are byte sequences for when you read things in binary (reading files with "b" mode, getting stuff throuh a network, etc...).
Thank you. Will try python 3 to see whether it works
Posts: 687
Threads: 37
Joined: Sep 2016
(Feb-06-2017, 01:57 AM)forfan Wrote: Hello, I am new here in this forum.
I am current working on a python program manipulating a Mysql database.
The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be
'pantry购物'
which is an Chinese sentence.
The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website.
I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence.
Can anyone here help me?
Thank you in advance. All depends if 'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9' is a representation (ie, repr ) or a string ( str ).
However, if you take a closer look, in UTF-8: - C3A8 encodes 'E8'
- C2B4 encodes 'B4'
- C2AD encodes 'AD'
And, also in UTF-8, 'E8B4AD' encodes '8D2D' which is coincidentally the unicode for 购. So I would guess that you have one UTF-8 encoding too many here.
Note: when dealing with characters encoding, it is very easy to get fooled by the terminal output. Checking lengths in bytes and characters usually lifts ambiguities and using repr() instead of str() usually helps.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Posts: 4,654
Threads: 1,497
Joined: Sep 2016
Feb-07-2017, 01:32 AM
(This post was last modified: Feb-07-2017, 01:32 AM by Skaperen.)
separating sequences of characters (typically implemented with 32 bits per character) and sequences of bytes (operates as 8 bits per byte) is what is needed to make sure the encoding/decoding relationships are done right. python 3 has this.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 7
Threads: 1
Joined: Feb 2017
Feb-07-2017, 05:47 AM
(This post was last modified: Feb-07-2017, 05:49 AM by forfan.)
(Feb-06-2017, 02:28 PM)Ofnuts Wrote: (Feb-06-2017, 01:57 AM)forfan Wrote: Hello, I am new here in this forum.
I am current working on a python program manipulating a Mysql database.
The database is set as utf-8-bin, I use Mysql-connector to connect and fetch records.
I am suffer from retrieving records written in CJK ( Chinese/Japanese/Korean ) language. Here is one example I retrieved from the database:
'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
the correct record should be
'pantry购物'
which is an Chinese sentence.
The interesting thing is, I have another PHP web portal program which is also communicating with the same database. By the PHP program, the above sentence is corrected retrieved and shown in website.
I also tried to decode the retrieved record with utf-8 by RECORD.decode('utf-8','ignore'), then the program gave me a garbled text but not the correct Chinese sentence.
Can anyone here help me?
Thank you in advance. All depends if 'pantry\xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9' is a representation (ie, repr ) or a string (str ).
However, if you take a closer look, in UTF-8:- C3A8 encodes 'E8'
- C2B4 encodes 'B4'
- C2AD encodes 'AD'
And, also in UTF-8, 'E8B4AD' encodes '8D2D' which is coincidentally the unicode for 购. So I would guess that you have one UTF-8 encoding too many here.
Note: when dealing with characters encoding, it is very easy to get fooled by the terminal output. Checking lengths in bytes and characters usually lifts ambiguities and using repr() instead of str() usually helps.
Thank you very much.
I think the text retrieved from database is string.
I am not familiar with encode/decode issues. If I put the question in another way, suppose that I have a string
' \xc3\xa8\xc2\xb4\xc2\xad\xc3\xa7\xe2\x80\xb0\xc2\xa9'
I want print this string as an encoded utf-8 character and shown in output as "购物", how can I do it?
(Feb-07-2017, 01:32 AM)Skaperen Wrote: separating sequences of characters (typically implemented with 32 bits per character) and sequences of bytes (operates as 8 bits per byte) is what is needed to make sure the encoding/decoding relationships are done right. python 3 has this.
Thank you !
Will definitely give python 3 a shot.
|