Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Want a list utf8 formatted but bytestrings found
#1
populate client listing into list
names.append( name )

names.append( '' )
names.sort()
names = tuple( [s.encode('utf8') for s in names] )
When i try the above to utf8 names i receive this error:

TypeError: sequence item 0: expected str instance, bytes found

Why does it says its bytes instead of being a string? How to correct it?

I'am using python 3 and if i try to print names i get:



Quote:['', 'Alexander Lepsveridze', 'John Comeau', '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82', '\xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82 \xce\xa4\xcf\x83\xce\xbf\xcf\x84\xcf\x85\xce\xbb\xce\xaf\xce\xbf\xcf\x85'



What are these '\xce'? Bytestrings? What i excpected was Greek letter names there because the list was filled initially with Greek names.
Quote
#2
The \x00 you see, are bytes in hexadecimal representation. This is the representation of the string.
This representation is used in str, bytes, bytearray.
All characters, which can not displayed or are control characters, are displayed in this format.
If you print them, you don't see this internal representation of string literals.

With your data:
items = ['', 'Alexander Lepsveridze', 'John Comeau', '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82', '\xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82 \xce\xa4\xcf\x83\xce\xbf\xcf\x84\xcf\x85\xce\xbb\xce\xaf\xce\xbf\xcf\x85']

for item in items:
    print(item)
Output:
Alexander Lepsveridze John Comeau ÎÎºÎ·Ï Î¤ÏÎ¹Î¬Î¼Î·Ï ÎÎ¼Î¹Î»Î¿Ï Î¤ÏÎ¿Ï Ï Î»Î¯Î¿Ï
Now with a module, which can fix broken encodings:

import ftfy
items = ['', 'Alexander Lepsveridze', 'John Comeau', '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82', '\xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82 \xce\xa4\xcf\x83\xce\xbf\xcf\x84\xcf\x85\xce\xbb\xce\xaf\xce\xbf\xcf\x85']

for item in items:
    print(ftfy.fix_encoding(item))
Output:
Alexander Lepsveridze John Comeau Άκης Τσιάμης Όμιλος Τσοτυλίου
The string was originally utf8, but was encoded with latin1.

print(items[-1].encode('latin1').decode('utf8'))
Output:
Όμιλος Τσοτυλίου
buran and snippsat like this post
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote
#3
Hello and thank you for the clear explanation, i tried the following after reading your post

	#populate client listing into list
	names.append( name )
    ....
    ....
	names.append( '' )
	names.sort()

	for name in names:
		name = name.encode('latin1').decode('utf8')
and the error that was presented was:

Output:
UnicodeEncodeError('latin-1', 'Άκης Τσιάμης', 0, 4, 'ordinal not in range(256)')
Why it cannot encode in latin nad decode in utf8 normally?
And since 'names' are being fetced from mysql database, which they were stored as utf8 strings WHY/HOW the 'names' enrolled in latin-1?
Quote
#4
My example worked only with then wrong encoded strings. The string was already in utf8 encoded. But at some place the string has been encoded to latin1. This happens with Python 2 very often, because there is no difference between str and bytes. It is possible with Python 2 to encode a string twice, which ends in a broken text. Often encoding/decoding was used in Python 2 incorrectly.

Scenario 1:
The charset of the database has been changed afterwards without doing a conversion of the old data. Latin1 is a relict from ancient times, but it's still used.

Scenario 2:
The program, which filled the database, did the encoding wrong.
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote
#5
I'm using python3.

How am i supposed to save 'names' to utf8 encoding? Otherise if i print them it prints garbage.
Quote
#6
(Feb-14-2019, 04:55 PM)nikos Wrote: How am i supposed to save 'names' to utf8 encoding? Otherise if i print them it prints garbage.
Use encoding utf-8 in and out.
s = 'Crème and Spicy jalapeño ☂'
with open('unicode.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)

with open('unicode.txt', encoding='utf8') as f:
    data = f.read()
    print(data)
Output:
Crème and Spicy jalapeño ☂
(Feb-14-2019, 01:52 PM)nikos Wrote: And since 'names' are being fetced from mysql database, which they were stored as utf8 strings WHY/HOW the 'names' enrolled in latin-1?
Set UTF8 in connection not Latin1 or something else.
In the connection itself use_unicode=True and charset: utf8.
Make sure that your Unicode strings(which are normal text Python 3) don't accidentally get converted,formatting to something else.
Quote
#7
I'm using Python3 and pymysql and already have charset presnt

con = pymysql.connect( db = 'clientele', user = 'vergos', passwd = '******', charset = 'utf8' )
cur = con.cursor()
From that i understand that the names being fetched from the db to pyhton script are being fetced as utf8, right?

I dont convert, format the string in the meanwhile. Python3 handles the encoidng and i dont know from where latin iso get into the middle but when i

names = tuple( [s.encode('latin1').decode('utf8') for s in names] )
Output:
UnicodeEncodeError('latin-1', 'Άκης Τσιάμης', 0, 4, 'ordinal not in range(256)')
which is a perfectly valid names but still gives an error.

Also the strings produced '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82' are strings not raw bytes.

WHY Python3 instead of fetching the values from the db as 'utf8' it stores the values in hex representation?
Quote
#8
Anyone tell me how to decode names properly?
Quote
#9
(Feb-14-2019, 08:11 PM)nikos Wrote: Also the strings produced '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82' are strings not raw bytes.

WHY Python3 instead of fetching the values from the db as 'utf8' it stores the values in hex representation?
It has nothing to with Python 3,but how data is stored in DB.
They have used latin-1.
nikos Wrote:Anyone tell me how to decode names properly?
Do you get error if doing this?
Test Python 3.7:
>>> items = ['', 'Alexander Lepsveridze', 'John Comeau', '\xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82', '\xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82 \xce\xa4\xcf\x83\xce\xbf\xcf\x84\xcf\x85\xce\xbb\xce\xaf\xce\xbf\xcf\x85']
>>> for i in items:
...     print(i.encode('latin-1').decode('utf-8'))    
...     

Alexander Lepsveridze
John Comeau
Άκης Τσιάμης
Όμιλος Τσοτυλίου
Quote
#10
The english names appear ok the other ones with the weird encoding appear as so:

Output:
[Fri Feb 15 15:57:23.210609 2019] [wsgi:error] [pid 6326] [remote 176.92.27.182:9372] Alexander Lepsveridze [Fri Feb 15 15:57:23.210618 2019] [wsgi:error] [pid 6326] [remote 176.92.27.182:9372] John Comeau [Fri Feb 15 15:57:23.210625 2019] [wsgi:error] [pid 6326] [remote 176.92.27.182:9372] \xce\x86\xce\xba\xce\xb7\xcf\x82 \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82 [Fri Feb 15 15:57:23.210633 2019] [wsgi:error] [pid 6326] [remote 176.92.27.182:9372] \xce\x8c\xce\xbc\xce\xb9\xce\xbb\xce\xbf\xcf\x82
I have run this through my python3 wsgi script not via console.

But if i run it in repl.it i get the same output as yours.

i have also chnages the collation to utf8_general_ci in my database but still the names appear weird.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  How to fix error code 2 in python, “directory not found”? dav3javu 1 99 Apr-03-2019, 04:55 PM
Last Post: Larz60+
  How work with formatted text in Python? AlekseyPython 3 143 Mar-18-2019, 05:00 AM
Last Post: AlekseyPython
  Who converts data when writing to a database with an encoding different from utf8? AlekseyPython 1 127 Mar-04-2019, 08:26 AM
Last Post: DeaD_EyE
  Superfluous whitespace found? CaptainCsaba 0 371 Feb-13-2019, 01:54 PM
Last Post: CaptainCsaba
  modify line in file if pattern found in list. kttan 1 220 Dec-10-2018, 08:45 AM
Last Post: Gribouillis
  SyntaxError: multiple statements found while compiling a single statement DragonG 1 320 Nov-26-2018, 05:33 AM
Last Post: Larz60+
  How to assign a found regex expression to a variable Pedroski55 2 347 Nov-24-2018, 07:14 AM
Last Post: Pedroski55
  I found weird thing. catastrophe_K 1 322 Sep-29-2018, 09:59 AM
Last Post: gruntfutuk
  Spyder: Module Not Found jmair 2 1,067 Sep-25-2018, 05:14 PM
Last Post: jmair
  cx_Oracle module not found error PRADEEP 1 379 Sep-12-2018, 11:10 AM
Last Post: Larz60+

Forum Jump:


Users browsing this thread: 1 Guest(s)