Python Forum

Pages: 1 2

i'm reading each line of an open file in a loop. how can i specify a utf-8 encoding is how the file is stored and that i want to read into strings as Unicode. this is in Python3.

open() takes optional encoding parameter

from the docs

Quote:encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

The usage of what @buran link to.

uni_hello = 'hello Χαίρετε добры дзень 여보세요'

with open('uni_hello.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(uni_hello)

with open('uni_hello.txt', encoding="utf-8") as f:
    print(f.read())

Output:
hello Χαίρετε добры дзень 여보세요

A couple of advice when all fails.

with open('some_file', encoding='utf-8', errors='ignore') as f:
with open('some_file', encoding='utf-8', errors='replace') as f:

ftfy fixes Unicode that’s broken in various ways.

what if i am not the one doing the open? the file is already open. it might be stdin. it might be passed by a call to this code. it might be returned from a function call.

is there a change_encoding() call that i could use instead of open()?

and it looks like Python3 still needs some more tweaks. the print() function complains about the "ascii" encoding if given a Unicode character to print on a platform that supports UTF-8, yet has no support for an encoding option to fix it.

Python3 is not ready for prime time (not that Python2 is even close).

i was moving over to Python3 fully, abandoning Python2. i think i need to cancel that plan.

Python3 needs to assume the world is Unicode/UTF-8 and deal with the few exceptions as special cases.

there should be a sys.encoding() function that sets a program-wide encoding for everything.

(Jun-27-2018, 10:02 PM)Skaperen Wrote: [ -> ]the print() function complains about the "ascii" encoding if given a Unicode character to print on a platform that supports UTF-8, yet has no support for an encoding option to fix it.

Can you give a example?

Quote:i was moving over to Python3 fully, abandoning Python2. i think i need to cancel that plan.

Make no sense,most have moved over now,i did move fully like 4-years ago with Python 3.4.

i did the command py3 seeintent.py fd12.py which was printing lots of ASCII stuff for debugging then printed the file it read in and it ended with:

Output:Traceback (most recent call last):
  File "seeintent.py", line 54, in <module>
    print(line)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 10: ordinal not in range(128)
lt1/pdh /home/pdh 73> lines 53 56 seeintent.py
for line in a:
    print(line)
stdout.flush()
lt1/pdh /home/pdh 74> lines 15 18 fd12.py
__license__ = """
Copyright © 2016, by Phil D. Howard - all other rights reserved

lt1/pdh /home/pdh 75>

the character at issue is '©' (code point U+00A9) which should be handled OK if stdout was open()'d with encoding='utf-8'. but it wasn't. and i couldn't

(Jun-28-2018, 02:01 AM)Skaperen Wrote: [ -> ]UnicodeEncodeError: 'ascii' codec can't encode character

This show that your Terminal/OS setup is the problem or maybe call Python 2,i think i mention this before to you.
You should not get UnicodeEncodeError: 'ascii' in Python 3,butUnicodeEncodeError: 'utf-8' if problem with encoding.

Quote:The default encoding for Python 3 source code is UTF-8.
The default encoding for Python 2 source code is ASCII.

mint@mint ~ $ python3
Python 3.6.4 (default, Mar 15 2018, 15:35:10) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '\xa9'
>>> s
'©'
>>> print(s)
©
>>> b = b'\xa9'
>>> b
b'\xa9'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte
>>> print(b.decode('latin'))
©
>>>

Test:

python3 -c "import sys; print(sys.stdout.encoding)"
UTF-8

mint@mint ~ $ python3 -c "print('Spicy jalapeño ☂')"
Spicy jalapeño ☂

mint@mint ~ $ python3
Python 3.6.4 (default, Mar 15 2018, 15:35:10) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'
>>> exit()

locale:

mint@mint ~ $ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Read write:

# uni.py
uni = '©'
with open('uni.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(uni)
 
with open('uni.txt', encoding="utf-8") as f:
    print(f.read())

Output:mint@mint ~ $ python3 uni.py
©

If the encoding of a file is unknown, you can only guess.
Maybe the chardet module helps you.

with open('./test_seq.py', 'rb') as fd:
    # reading bytes
    for line in fd:
        encoding = chardet.detect(line).get('encoding')
        print(line.decode(encoding), end='')

To fix wrong encoded strings, you can use ftfy.
In this case the it have to be a str and not bytes. In the documentation they write, that you should try
chardet to guess encoding on bytes.

the encoding of the file is UTF-8. some files may still be ASCII, but fd12.py is UTF-8. and i explicitly ran "py3" which is my own symlink to "python3".

so, how can i find the encoding that python3 has decided for my terminal? maybe that is the issue.

(Jun-28-2018, 08:53 PM)Skaperen Wrote: [ -> ]how can i find the encoding that python3 has decided for my terminal? maybe that is the issue.

Is should be that sys.stdout.encoding is set correctly,
this will allow Terminal to work seamlessly with Python 3 string(Unicode by default).

python3 -c "import sys; print(sys.stdout.encoding)"
UTF-8

Test Unicode and look at locale.

mint@mint ~ $ python3 -c "print('Spicy 富强 ☂')"
Spicy 富强 ☂

mint@mint ~ $ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Setting locale to example C,Terminal should not manage that line.

mint@mint ~ $ export LC_ALL=C
mint@mint ~ $ export LANG=C

mint@mint ~ $ locale
LANG=C
LANGUAGE=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=C

mint@mint ~ $ python3 -c "print('Spicy 富强 ☂')"
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 13-18: surrogates not allowed

mint@mint ~ $ python3 -c "import sys; print(sys.stdout.encoding)"
ANSI_X3.4-1968

Pages: 1 2

Skaperen

buran

snippsat

Skaperen

snippsat

Skaperen

snippsat

DeaD_EyE

Skaperen

snippsat