i thought everything was utf-8 in python3

Skaperen · Jan-19-2017, 05:46 AM

i thought everything was utf-8 in python3. but apparently, this is not so. what suggests this to me is the "ascii" in the error message.

this is the minimal code:

word_file_name = '/usr/share/dict/american-english'
with open( word_file_name, 'r' ) as word_file:
    words = word_file.read( 4194304 )
print( 'there are', len(words), 'words' )

and this is the result of running it:

Output:lt1/forums /home/forums 3> python3 words.py
Traceback (most recent call last):
  File "words.py", line 3, in <module>
    words = word_file.read( 4194304 )
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9104: ordinal not in range(128)
lt1/forums /home/forums 4>

it was run on ubuntu 16.04.1
this is reading the american english word dictionary file which has a lot of foreign words with foreign characters in it.

anyone know how to fix this or make it work?

i googled for answers and most had various "solutions" to make things be utf-8 (and a few others) but nothing worked.

the code is the start of bigger code to do processing using these words.

***snippsat*** · Jan-19-2017, 06:18 AM

(Jan-19-2017, 05:46 AM)Skaperen Wrote: i thought everything was utf-8 in python3.

It's not utf-8 it can be other encoding like eg latin-1.
In Python 3 are all strings sequences of Unicode characters(can be different encoding).
So all that is getting in to Python 3 need a encoding,Python 2 did cheat here.

Now you read in as ascii,set encoding.

with open('som_file.txt', encoding='utf8') as f:
    data = f.read()

So if it give error try latin-1.

You have chardet,that can detect encoding pretty good.

Skaperen · Jan-19-2017, 06:43 AM

Output:lt1/forums /home/forums 16> chardet /usr/share/dict/american-english
/usr/share/dict/american-english: utf-8 with confidence 0.99

wavic · (This post was last modified: Jan-19-2017, 07:32 AM by wavic.)

Hm! It is american English as you say. How about to put an encoding declaration on top of the script. I don't know the encoding name of the american English but you know it

#!/usr/bin/env python3
# -*- coding: windows-1252 -*-

For example

us-ascii?
http://scratchpad.wikia.com/wiki/Charact..._Languages

Skaperen · (This post was last modified: Jan-21-2017, 09:56 AM by Skaperen.)

(Jan-19-2017, 07:32 AM)wavic Wrote: Hm! It is american English as you say. How about to put an encoding declaration on top of the script. I don't know the encoding name of the american English but you know it
#!/usr/bin/env python3
# -*- coding: windows-1252 -*-
For example

us-ascii?
http://scratchpad.wikia.com/wiki/Charact..._Languages

this defines the coding of the source file itself, not the data file being read. a program/script may read many files. each with different encodings.

so what is needed is a way to define this about the data file (which is static read-only data).

and chardet thought this file is utf-8. maybe it is utf-8 or maybe it is latin-1.

***snippsat*** · Jan-21-2017, 10:08 AM

(Jan-21-2017, 09:55 AM)Skaperen Wrote: and chardet thought this file is utf-8. maybe it is utf-8 or maybe it is latin-1.

It's utf-8 when you get 0.99 back.
Doesn't my code work,it will read it in as utf-8?

Skaperen · Jan-23-2017, 01:53 AM

(Jan-21-2017, 10:08 AM)snippsat Wrote: Doesn't my code work,it will read it in as utf-8?

i didn't see any complete code that i could try. and what i need to end up with is a list of every word or line in the file. then i need remove all words that cannot be in ascii. i want a list of the ascii words (but i will try other words).

Skaperen · Jan-23-2017, 03:44 AM

i messed around with my app starter code from some things the google found and it now looks like print() is the troublemaker i can't seem to fix.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import division, print_function, unicode_literals

import sys

word_file_name = '/usr/share/dict/american-english'


def main( args ):
    """Display selected words"""

# read in big list of words from the system dictionary

    with open( word_file_name, 'r', encoding='utf-8' ) as word_file:
        words = word_file.read( 4194304 )
    word_list = words.split('\n')
    
# rebuild word list just from words with all chars ord()<128

    new_words = []
    for word in word_list:
        if word.isalpha():
            new_words.append( word )

# output all of the selected words

    for word in new_words:
        print( word )
    sys.stdout.flush()

# all done

    return
    
if __name__ == '__main__':
    try:
        result = main( sys.argv )
        sys.stdout.flush()
    except KeyboardInterrupt:
        result = 141
        print( '' )
    except IOError:
        result = 142
    try:
        exit( int( result ) )
    except ValueError:
        print( str( result ), file=sys.stderr )
        exit( 1 )
    except TypeError:
        if result == None:
            exit( 0 )
        exit( 255 )

# EOF

Output:lt1/forums /home/forums 6> py3 chartest.py|wc -l
Traceback (most recent call last):
  File "chartest.py", line 38, in <module>
    result = main( sys.argv )
  File "chartest.py", line 29, in main
    print( word )
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 6: ordinal not in range(128)
622
lt1/forums /home/forums 7>

a funny thing about this error message is that the actual trouble word has bytes c3b3 for the utf-8 encoded character that is triggering this problem. the whole word in hex is 4173756e6369c3b36e. i don't know what '\xf3' the error message is referring to or what it could mean.

my terminal console displays that whole word correctly.

Output:lt1/forums /home/forums 7> fgrep -n Asunci /usr/share/dict/american-english
1053:Asunción
1054:Asunción's
lt1/forums /home/forums 8>

it turns out print() will work if i do print( word.encode() ) in this case. does anyone know why strings (as opposed to bytes) and utf-8 do not play well together in python 3?

***snippsat*** · (This post was last modified: Jan-23-2017, 05:43 AM by snippsat.)

# -*- coding: utf-8 -*-
from __future__ import division, print_function, unicode_literals

Remove these lines dos nothing for Python 3,in Python 3 utf-8 is default.

Quote:UnicodeEncodeError: 'ascii' codec

This is just strange that python print() give this error.
Now do you run trough shell with sys.argv,so that can maybe cause it.

There are some not so god stuff in your code.
word_file_name = '/usr/share/dict/american-english'
Should not be in global namespace,but be given as an argument to the function.
def main( args ):
Argument args is not used in function?

Quote:it turns out print() will work if i do print( word.encode() )

Make no sense that you should convert back to bytes to print it.
Test online and see if work repel.it.

Shell test,i don't have Linux with Python 3 available now,but command should be the same.

C:\
λ which python
/c/python36/python

C:\
λ python -c "import sys; print(sys.stdout.encoding)"
utf-8

C:\
λ python -c "print('1053:Asunción')"
1053:Asunción

C:\
λ python -c "print('Spicy jalapeño ☂')"
Spicy jalapeño ☂

Quote:does anyone know why strings (as opposed to bytes) and utf-8 do not play well together in python 3?

They make sure(a very important design) to not mix bytes and string,
it would have been terrible(back to Python 2) if they work together.
Now is string characters of Unicode.

Skaperen · Jan-23-2017, 08:20 AM

i still don't know, for this website, how to quote a post and break it into pieces to do split replies. so i will just answer by "quoting" text by typing it or doing copy/paste. errors can easily happen.

this is a script just for testing stuff. i do not commit time for good code in this case. that's one reason the dictionsry file name is hardcoded in the source.

this is going off from the original program. the original has a failure that was not obvious to me. so i do th "make minimal example". usually that means i copy the original to a new name (forked if i am keeping it under my revision control system, which is rare) and reduce it in steps, keeping the original error. this can be messy process and many things remain just because it didn't get there or was not sure. that's one reason why those two lines you suggested i remove were there. another is that i was testing python 2, too. but, no longer. so, out they go.

as for errors i think str.encode() is a big culprit. it may end up being argued that this kind of thing should never be a string.

having args in main is just another leftover. i don't worry about for now. it's not in the very minimal code (testutf.py)

i am going to have trouble doing a couple of those commands you gave me as my shell (bash 4.3.46(1)) does not seem to properly handle utf-8 input. it may be the gnu readline thing doing this. i just did a test and it appears to be a readline issue.

i am seeing reasons for separating a lot of this stuff. but i need a way to print the utf-8 stuff i have without it being modified. doing print() of bytes just gets the equivalent of repr() to run. so print('abc'.encode() gets b'abc' (just tried it on repl.it).

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Gnuradio python3 is not compatible python3 xmlrpc library How Can I Fix İt ?	muratoznnnn	3	6,209	Nov-07-2019, 05:47 PM Last Post: DeaD_EyE
	Printing from a text file not working as I thought it would	PythonZenon	10	8,426	Jun-02-2018, 09:19 PM Last Post: snippsat

i thought everything was utf-8 in python3

User Panel Messages

Announcements