Python Forum
Want a list utf8 formatted but bytestrings found
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Want a list utf8 formatted but bytestrings found
#1
populate client listing into list
names.append( name )

names.append( '' )
names.sort()
names = tuple( [s.encode('utf8') for s in names] )
When i try the above to utf8 names i receive this error:

TypeError: sequence item 0: expected str instance, bytes found

Why does it says its bytes instead of being a string? How to correct it?

I'am using python 3 and if i try to print names i get:



Quote:['', 'Alexander Lepsveridze', 'John Comeau', '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82', '\xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82 \xce\xa4\xcf\x83\xce\xbf\xcf\x84\xcf\x85\xce\xbb\xce\xaf\xce\xbf\xcf\x85'



What are these '\xce'? Bytestrings? What i excpected was Greek letter names there because the list was filled initially with Greek names.
Reply
#2
The \x00 you see, are bytes in hexadecimal representation. This is the representation of the string.
This representation is used in str, bytes, bytearray.
All characters, which can not displayed or are control characters, are displayed in this format.
If you print them, you don't see this internal representation of string literals.

With your data:
items = ['', 'Alexander Lepsveridze', 'John Comeau', '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82', '\xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82 \xce\xa4\xcf\x83\xce\xbf\xcf\x84\xcf\x85\xce\xbb\xce\xaf\xce\xbf\xcf\x85']

for item in items:
    print(item)
Output:
Alexander Lepsveridze John Comeau ÎÎºÎ·Ï Î¤ÏÎ¹Î¬Î¼Î·Ï ÎÎ¼Î¹Î»Î¿Ï Î¤ÏÎ¿Ï Ï Î»Î¯Î¿Ï
Now with a module, which can fix broken encodings:

import ftfy
items = ['', 'Alexander Lepsveridze', 'John Comeau', '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82', '\xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82 \xce\xa4\xcf\x83\xce\xbf\xcf\x84\xcf\x85\xce\xbb\xce\xaf\xce\xbf\xcf\x85']

for item in items:
    print(ftfy.fix_encoding(item))
Output:
Alexander Lepsveridze John Comeau Άκης Τσιάμης Όμιλος Τσοτυλίου
The string was originally utf8, but was encoded with latin1.

print(items[-1].encode('latin1').decode('utf8'))
Output:
Όμιλος Τσοτυλίου
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
Hello and thank you for the clear explanation, i tried the following after reading your post

	#populate client listing into list
	names.append( name )
    ....
    ....
	names.append( '' )
	names.sort()

	for name in names:
		name = name.encode('latin1').decode('utf8')
and the error that was presented was:

Output:
UnicodeEncodeError('latin-1', 'Άκης Τσιάμης', 0, 4, 'ordinal not in range(256)')
Why it cannot encode in latin nad decode in utf8 normally?
And since 'names' are being fetced from mysql database, which they were stored as utf8 strings WHY/HOW the 'names' enrolled in latin-1?
Reply
#4
My example worked only with then wrong encoded strings. The string was already in utf8 encoded. But at some place the string has been encoded to latin1. This happens with Python 2 very often, because there is no difference between str and bytes. It is possible with Python 2 to encode a string twice, which ends in a broken text. Often encoding/decoding was used in Python 2 incorrectly.

Scenario 1:
The charset of the database has been changed afterwards without doing a conversion of the old data. Latin1 is a relict from ancient times, but it's still used.

Scenario 2:
The program, which filled the database, did the encoding wrong.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
I'm using python3.

How am i supposed to save 'names' to utf8 encoding? Otherise if i print them it prints garbage.
Reply
#6
(Feb-14-2019, 04:55 PM)nikos Wrote: How am i supposed to save 'names' to utf8 encoding? Otherise if i print them it prints garbage.
Use encoding utf-8 in and out.
s = 'Crème and Spicy jalapeño ☂'
with open('unicode.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)

with open('unicode.txt', encoding='utf8') as f:
    data = f.read()
    print(data)
Output:
Crème and Spicy jalapeño ☂
(Feb-14-2019, 01:52 PM)nikos Wrote: And since 'names' are being fetced from mysql database, which they were stored as utf8 strings WHY/HOW the 'names' enrolled in latin-1?
Set UTF8 in connection not Latin1 or something else.
In the connection itself use_unicode=True and charset: utf8.
Make sure that your Unicode strings(which are normal text Python 3) don't accidentally get converted,formatting to something else.
Reply
#7
I'm using Python3 and pymysql and already have charset presnt

con = pymysql.connect( db = 'clientele', user = 'vergos', passwd = '******', charset = 'utf8' )
cur = con.cursor()
From that i understand that the names being fetched from the db to pyhton script are being fetced as utf8, right?

I dont convert, format the string in the meanwhile. Python3 handles the encoidng and i dont know from where latin iso get into the middle but when i

names = tuple( [s.encode('latin1').decode('utf8') for s in names] )
Output:
UnicodeEncodeError('latin-1', 'Άκης Τσιάμης', 0, 4, 'ordinal not in range(256)')
which is a perfectly valid names but still gives an error.

Also the strings produced '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82' are strings not raw bytes.

WHY Python3 instead of fetching the values from the db as 'utf8' it stores the values in hex representation?
Reply
#8
Anyone tell me how to decode names properly?
Reply
#9
(Feb-14-2019, 08:11 PM)nikos Wrote: Also the strings produced '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82' are strings not raw bytes.

WHY Python3 instead of fetching the values from the db as 'utf8' it stores the values in hex representation?
It has nothing to with Python 3,but how data is stored in DB.
They have used latin-1.
nikos Wrote:Anyone tell me how to decode names properly?
Do you get error if doing this?
Test Python 3.7:
>>> items = ['', 'Alexander Lepsveridze', 'John Comeau', '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82', '\xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82 \xce\xa4\xcf\x83\xce\xbf\xcf\x84\xcf\x85\xce\xbb\xce\xaf\xce\xbf\xcf\x85']
>>> for i in items:
...     print(i.encode('latin-1').decode('utf-8'))    
...     

Alexander Lepsveridze
John Comeau
Άκης Τσιάμης
Όμιλος Τσοτυλίου
Reply
#10
The english names appear ok the other ones with the weird encoding appear as so:

Output:
[Fri Feb 15 15:57:23.210609 2019] [wsgi:error] [pid 6326] [remote 176.92.27.182:9372] Alexander Lepsveridze [Fri Feb 15 15:57:23.210618 2019] [wsgi:error] [pid 6326] [remote 176.92.27.182:9372] John Comeau [Fri Feb 15 15:57:23.210625 2019] [wsgi:error] [pid 6326] [remote 176.92.27.182:9372] \xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82 [Fri Feb 15 15:57:23.210633 2019] [wsgi:error] [pid 6326] [remote 176.92.27.182:9372] \xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82
I have run this through my python3 wsgi script not via console.

But if i run it in repl.it i get the same output as yours.

i have also chnages the collation to utf8_general_ci in my database but still the names appear weird.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [SOLVED] [Windows] Converting filename to UTF8? Winfried 5 2,447 Sep-06-2022, 10:47 PM
Last Post: snippsat
  Formatted string not translated by gettext YvanM 10 1,917 Sep-02-2022, 08:46 PM
Last Post: YvanM
  Split string using variable found in a list japo85 2 1,238 Jul-11-2022, 08:52 AM
Last Post: japo85
  How can I found how many numbers are there in a Collatz Sequence that I found? cananb 2 2,505 Nov-23-2020, 05:15 PM
Last Post: cananb
  How to run a method on an argument in a formatted string Exsul 1 1,649 Aug-30-2019, 01:57 AM
Last Post: Exsul
  How work with formatted text in Python? AlekseyPython 3 2,775 Mar-18-2019, 05:00 AM
Last Post: AlekseyPython
  Who converts data when writing to a database with an encoding different from utf8? AlekseyPython 1 2,331 Mar-04-2019, 08:26 AM
Last Post: DeaD_EyE
  modify line in file if pattern found in list. kttan 1 2,188 Dec-10-2018, 08:45 AM
Last Post: Gribouillis
  How to detect and tell user that no matches were found in a list RedSkeleton007 6 3,811 Jul-19-2018, 06:27 PM
Last Post: woooee
  How can I write formatted (i.e. bold, italic, change font size, etc.) text to a file? JohnJSal 6 23,960 Jun-19-2018, 03:43 PM
Last Post: JohnJSal

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020