Python Forum

Full Version: Right way to write string into UTF-8 file?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hello,

I need to append a string to a text file that's encoded in UTF-8.

It appears that, by default, Python 3 tries to write in ANSI (Latin-1, ISO8859-1, cp1252, or what ever is the correct name). As a result, I end up with a file that cannot be correctly displayed, since it uses two encoding methods in the same file.

[Image: A961_E3_BF-_F782-46_E2-839_B-129370_CDF8_B8.png]
(In ANSI, "è" is indeed 0xE8).

I tried the following but it doesn't work:

file = open("test.latin1.utf8.txt","w")
file.write("Crème")

stringutf8 ="Crème".encode('utf-8')
print(stringutf8)
#BAD Error: TypeError: write() argument must be str, not bytes

file.write(stringutf8)
file.close()
Any idea how to do this?

Thank you.

---
Edit: Python won't let me open the file in UTF8 since it detects an ANSI character ("É" = 0xc9) wrongly added by another script; And it won't let me replace that faulty string either:

#Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 327: invalid continuation byte
f = codecs.open(inputfile, "r", "utf-8")
content = f.read()
f.close()
…
filename =" Crème"
#Error: AttributeError: 'str' object has no attribute 'decode'
filename = filename.decode('utf-8')
Python 3 has full Unicode support and has default encoding as UTf-8.
Always for file out and in use UTF-8,
do not encode decode anything if not necessary as for taking text(Unicode as default) out and in from Python 3.
s = 'Crème and Spicy jalapeño ☂'
with open('unicode.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)
Output on disk.
Output:
Crème and Spicy jalapeño ☂
Read in:
with open('unicode.txt', encoding='utf-8') as f:
    print(f.read())
Output:
Crème and Spicy jalapeño ☂
Thank you.

The code works fine as-is, but for some reason, it messes up the original file when I use this code:

with open ("input.gpx" , 'r' ) as f:
    content = f.read()

content += 'Crème and Spicy jalapeño ☂'
with open(output.gpx", 'w', encoding='utf-8') as f_out:
    f_out.write(content)
[Image: 318007_EA-09_D4-43_CA-_BF2_D-16_F2_AE5570_F4.png]

If you'd like to give it a quick shot: https://we.tl/t-neH7vye8wd
You don't have encoding='utf-8' when you read the file as i show.
Test with your file.
with open('input.gpx') as f:
    print(f.read())
Output:
<?xml version="1.0" encoding="UTF-8"?> <desc>Départ entre 7 et 8h</desc> </gpx>
Fix:
with open('input.gpx', encoding='utf-8') as f:
    print(f.read())
Output:
<?xml version="1.0" encoding="UTF-8"?> <desc>Départ entre 7 et 8h</desc> </gpx>
As it's a .xml file,BS parser test.
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('input.gpx', encoding='utf-8'), 'xml')
print(soup.find('desc').text)
Output:
Départ entre 7 et 8h
@Winfried Make sure you are using python 3. One of the major achievement of python 3 over python 2 is the correct handling of unicode data.
Thanks much for the tip on open(…, encoding='utf-8')!

I've also learned that UltraEdit (famous Windows editor) encodes files in Latin1 while PyScripter (IDE) uses UTF-8, so the latter is a much better alternative when working with accented strings.

stuff = "Crème"
with open("cp1252.txt", 'w') as outFile:
	outFile.write(stuff)
with open("utf8.txt", mode='w',encoding='utf-8') as outFile:
	outFile.write(stuff)
Using open() without any additional option meant that I ended up with a mix of Latin1 and UTF-8, which prevented me from using GpsBabel to merge GPX files.

I am using Python 3(.7.0).

Thanks again.

--
Edit: UltraEdit seems to have been rewritten to use Unicode instead.
(Aug-29-2018, 05:22 AM)Winfried Wrote: [ -> ]UltraEdit (famous Windows editor)
never heard of it, so maybe not that famous, but anyway look at http://forums.ultraedit.com/set-default-...17446.html

EDIT: I see you added that there is new version with default UTF support
It's a Windows editor that's been around for about 25 years.

I have yet another encoding issue, this time with geojson :-/

If I use its dump(), UTF8 data is turned into UTF16 (apparently), eg. "Fran\u00e7ois" instead of "François":

with open('input.geojson', encoding='utf-8') as f:
	gj = geojson.load(f)

for track in gj['features']:
	#NO DIFF with open(track['properties']['name'][0] + '.geojson', 'a+', encoding='utf-8') as f:
	with open(track['properties']['name'][0] + '.geojson', 'a+') as f:
		dump(track, f, indent=2)
		
		#UnicodeEncodeError: 'charmap' codec can't encode character '\u2194' in position 7: character maps to <undefined>
		#dump(track, f, indent=2, ensure_ascii=False)
		
		#NOT DEFINED
		#dumps(track, f, indent=2)
		
		#AttributeError: encode
		#dump(track.encode("utf-8"), f, indent=2, ensure_ascii=False)
As shown, I found and tried several things, all to no avail.

Should I use another method than "dump"?
If there's no way around it, I can live with accents being human-unreadable, but I wonder why JSON turns them into eg \u00e7.

BTW, I learned that you can't simply dump tracks into a file like I did above: To have a clean file, you must first build a list, and a collection, and dump the collection:

with open(INPUTFILE, encoding='utf-8') as f:
    gj = geojson.load(f)

features = []
for track in gj['features']:
    features.append(track)

feature_collection = FeatureCollection(features)
with open('myfile.geojson', 'w') as f:
    dump(feature_collection, f, indent=2)
(Aug-29-2018, 07:06 AM)Winfried Wrote: [ -> ]Should I use another method than "dump"?
Not sure because don't know where i goes wrong,don'not have file you use.
You should not use f in both read and dump.

Here a example where i use a geojson file and put in François.
import json
from pprint import pprint

with open('map.geojson' , encoding='utf-8') as f:
    data = json.load(f)
    pprint(data)
Output:
{'features': [{'geometry': {'coordinates': [[8.9208984375, 61.05828537037916], [9.84375, 61.4597705702975], [10.7666015625, 60.930432202923335]], 'type': 'François'}, 'properties': {}, 'type': 'Feature'}], 'type': 'FeatureCollection'}
Dump data.
import json

with open("data_file.json", "w", encoding='utf-8') as f_out:
    json.dump(data, f_out, ensure_ascii=False
Content of data_file.json.
Output:
{ "type": "FeatureCollection", "features": [ { "type": "Feature", "properties": {}, "geometry": { "type": "François", "coordinates": [ [ 8.9208984375, 61.05828537037916 ], [ 9.84375, 61.4597705702975 ], [ 10.7666015625, 60.930432202923335 ] ] } } ] }
Pages: 1 2