Jun-27-2018, 03:41 AM
i'm reading each line of an open file in a loop. how can i specify a utf-8 encoding is how the file is stored and that i want to read into strings as Unicode. this is in Python3.
Quote:encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.
uni_hello = 'hello Χαίρετε добры дзень 여보세요' with open('uni_hello.txt', 'w', encoding='utf-8') as f_out: f_out.write(uni_hello) with open('uni_hello.txt', encoding="utf-8") as f: print(f.read())
Output:hello Χαίρετε добры дзень 여보세요
A couple of advice when all fails. with open('some_file', encoding='utf-8', errors='ignore') as f: with open('some_file', encoding='utf-8', errors='replace') as f:ftfy fixes Unicode that’s broken in various ways.
sys.encoding()
function that sets a program-wide encoding for everything.(Jun-27-2018, 10:02 PM)Skaperen Wrote: [ -> ]the print() function complains about the "ascii" encoding if given a Unicode character to print on a platform that supports UTF-8, yet has no support for an encoding option to fix it.Can you give a example?
Quote:i was moving over to Python3 fully, abandoning Python2. i think i need to cancel that plan.Make no sense,most have moved over now,i did move fully like 4-years ago with Python 3.4.
py3 seeintent.py fd12.py
which was printing lots of ASCII stuff for debugging then printed the file it read in and it ended with:Output:Traceback (most recent call last):
File "seeintent.py", line 54, in <module>
print(line)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 10: ordinal not in range(128)
lt1/pdh /home/pdh 73> lines 53 56 seeintent.py
for line in a:
print(line)
stdout.flush()
lt1/pdh /home/pdh 74> lines 15 18 fd12.py
__license__ = """
Copyright © 2016, by Phil D. Howard - all other rights reserved
lt1/pdh /home/pdh 75>
the character at issue is '©' (code point U+00A9) which should be handled OK if stdout was open()'d with encoding='utf-8'. but it wasn't. and i couldn't(Jun-28-2018, 02:01 AM)Skaperen Wrote: [ -> ]UnicodeEncodeError: 'ascii' codec can't encode characterThis show that your Terminal/OS setup is the problem or maybe call Python 2,i think i mention this before to you.
UnicodeEncodeError: 'ascii'
in Python 3,butUnicodeEncodeError: 'utf-8'
if problem with encoding.Quote:The default encoding for Python 3 source code is UTF-8.
The default encoding for Python 2 source code is ASCII.
mint@mint ~ $ python3 Python 3.6.4 (default, Mar 15 2018, 15:35:10) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> s = '\xa9' >>> s '©' >>> print(s) © >>> b = b'\xa9' >>> b b'\xa9' Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte >>> print(b.decode('latin')) © >>>Test:
python3 -c "import sys; print(sys.stdout.encoding)" UTF-8 mint@mint ~ $ python3 -c "print('Spicy jalapeño ☂')" Spicy jalapeño ☂ mint@mint ~ $ python3 Python 3.6.4 (default, Mar 15 2018, 15:35:10) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.getdefaultlocale() ('en_US', 'UTF-8') >>> locale.getpreferredencoding() 'UTF-8' >>> exit()locale:
mint@mint ~ $ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=Read write:
# uni.py uni = '©' with open('uni.txt', 'w', encoding='utf-8') as f_out: f_out.write(uni) with open('uni.txt', encoding="utf-8") as f: print(f.read())
Output:mint@mint ~ $ python3 uni.py
©
with open('./test_seq.py', 'rb') as fd: # reading bytes for line in fd: encoding = chardet.detect(line).get('encoding') print(line.decode(encoding), end='')To fix wrong encoded strings, you can use ftfy.
chardet
to guess encoding on bytes.(Jun-28-2018, 08:53 PM)Skaperen Wrote: [ -> ]how can i find the encoding that python3 has decided for my terminal? maybe that is the issue.Is should be that
sys.stdout.encoding
is set correctly,python3 -c "import sys; print(sys.stdout.encoding)" UTF-8Test Unicode and look at locale.
mint@mint ~ $ python3 -c "print('Spicy 富强 ☂')" Spicy 富强 ☂ mint@mint ~ $ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=en_US.UTF-8Setting locale to example C,Terminal should not manage that line.
mint@mint ~ $ export LC_ALL=C mint@mint ~ $ export LANG=C mint@mint ~ $ locale LANG=C LANGUAGE= LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C mint@mint ~ $ python3 -c "print('Spicy 富强 ☂')" Unable to decode the command from the command line: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 13-18: surrogates not allowed mint@mint ~ $ python3 -c "import sys; print(sys.stdout.encoding)" ANSI_X3.4-1968