Right way to write string into UTF-8 file?

Winfried · (This post was last modified: Aug-28-2018, 02:05 PM by Winfried.)

Hello,

I need to append a string to a text file that's encoded in UTF-8.

It appears that, by default, Python 3 tries to write in ANSI (Latin-1, ISO8859-1, cp1252, or what ever is the correct name). As a result, I end up with a file that cannot be correctly displayed, since it uses two encoding methods in the same file.

[Image: A961_E3_BF-_F782-46_E2-839_B-129370_CDF8_B8.png]

[Image: A961_E3_BF-_F782-46_E2-839_B-129370_CDF8_B8.png]

(In ANSI, "è" is indeed 0xE8).

I tried the following but it doesn't work:

file = open("test.latin1.utf8.txt","w")
file.write("Crème")

stringutf8 ="Crème".encode('utf-8')
print(stringutf8)
#BAD Error: TypeError: write() argument must be str, not bytes

file.write(stringutf8)
file.close()

Any idea how to do this?

Thank you.

---
Edit: Python won't let me open the file in UTF8 since it detects an ANSI character ("É" = 0xc9) wrongly added by another script; And it won't let me replace that faulty string either:

#Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 327: invalid continuation byte
f = codecs.open(inputfile, "r", "utf-8")
content = f.read()
f.close()
…
filename =" Crème"
#Error: AttributeError: 'str' object has no attribute 'decode'
filename = filename.decode('utf-8')

***snippsat*** · (This post was last modified: Aug-28-2018, 02:27 PM by snippsat.)

Python 3 has full Unicode support and has default encoding as UTf-8.
Always for file out and in use UTF-8,
do not encode decode anything if not necessary as for taking text(Unicode as default) out and in from Python 3.

s = 'Crème and Spicy jalapeño ☂'
with open('unicode.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(s)

Output on disk.

Output:
Crème and Spicy jalapeño ☂

Read in:

with open('unicode.txt', encoding='utf-8') as f:
    print(f.read())

Output:
Crème and Spicy jalapeño ☂

Winfried · (This post was last modified: Aug-28-2018, 04:04 PM by Winfried.)

Thank you.

The code works fine as-is, but for some reason, it messes up the original file when I use this code:

with open ("input.gpx" , 'r' ) as f:
    content = f.read()

content += 'Crème and Spicy jalapeño ☂'
with open(output.gpx", 'w', encoding='utf-8') as f_out:
    f_out.write(content)

[Image: 318007_EA-09_D4-43_CA-_BF2_D-16_F2_AE5570_F4.png]

If you'd like to give it a quick shot: https://we.tl/t-neH7vye8wd

***snippsat*** · (This post was last modified: Aug-28-2018, 05:31 PM by snippsat.)

You don't have encoding='utf-8' when you read the file as i show.
Test with your file.

with open('input.gpx') as f:
    print(f.read())

Output:<?xml version="1.0" encoding="UTF-8"?>
<desc>DÃ©part entre 7 et 8h</desc>
</gpx>

Fix:

with open('input.gpx', encoding='utf-8') as f:
    print(f.read())

Output:<?xml version="1.0" encoding="UTF-8"?>
<desc>Départ entre 7 et 8h</desc>
</gpx>

As it's a .xml file,BS parser test.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('input.gpx', encoding='utf-8'), 'xml')
print(soup.find('desc').text)

Output:
Départ entre 7 et 8h

**Gribouillis** · Aug-28-2018, 07:59 PM

@Winfried Make sure you are using python 3. One of the major achievement of python 3 over python 2 is the correct handling of unicode data.

Winfried · (This post was last modified: Aug-29-2018, 05:22 AM by Winfried.)

Thanks much for the tip on open(…, encoding='utf-8')!

I've also learned that UltraEdit (famous Windows editor) encodes files in Latin1 while PyScripter (IDE) uses UTF-8, so the latter is a much better alternative when working with accented strings.

stuff = "Crème"
with open("cp1252.txt", 'w') as outFile:
	outFile.write(stuff)
with open("utf8.txt", mode='w',encoding='utf-8') as outFile:
	outFile.write(stuff)

Using open() without any additional option meant that I ended up with a mix of Latin1 and UTF-8, which prevented me from using GpsBabel to merge GPX files.

I am using Python 3(.7.0).

Thanks again.

--
Edit: UltraEdit seems to have been rewritten to use Unicode instead.

**buran** · (This post was last modified: Aug-29-2018, 05:41 AM by buran.)

(Aug-29-2018, 05:22 AM)Winfried Wrote: UltraEdit (famous Windows editor)

never heard of it, so maybe not that famous, but anyway look at http://forums.ultraedit.com/set-default-...17446.html

EDIT: I see you added that there is new version with default UTF support

Winfried · Aug-29-2018, 07:06 AM

It's a Windows editor that's been around for about 25 years.

I have yet another encoding issue, this time with geojson :-/

If I use its dump(), UTF8 data is turned into UTF16 (apparently), eg. "Fran\u00e7ois" instead of "François":

with open('input.geojson', encoding='utf-8') as f:
	gj = geojson.load(f)

for track in gj['features']:
	#NO DIFF with open(track['properties']['name'][0] + '.geojson', 'a+', encoding='utf-8') as f:
	with open(track['properties']['name'][0] + '.geojson', 'a+') as f:
		dump(track, f, indent=2)
		
		#UnicodeEncodeError: 'charmap' codec can't encode character '\u2194' in position 7: character maps to <undefined>
		#dump(track, f, indent=2, ensure_ascii=False)
		
		#NOT DEFINED
		#dumps(track, f, indent=2)
		
		#AttributeError: encode
		#dump(track.encode("utf-8"), f, indent=2, ensure_ascii=False)

As shown, I found and tried several things, all to no avail.

Should I use another method than "dump"?

Winfried · Aug-29-2018, 06:10 PM

If there's no way around it, I can live with accents being human-unreadable, but I wonder why JSON turns them into eg \u00e7.

BTW, I learned that you can't simply dump tracks into a file like I did above: To have a clean file, you must first build a list, and a collection, and dump the collection:

with open(INPUTFILE, encoding='utf-8') as f:
    gj = geojson.load(f)

features = []
for track in gj['features']:
    features.append(track)

feature_collection = FeatureCollection(features)
with open('myfile.geojson', 'w') as f:
    dump(feature_collection, f, indent=2)

***snippsat*** · Aug-29-2018, 06:22 PM

(Aug-29-2018, 07:06 AM)Winfried Wrote: Should I use another method than "dump"?

Not sure because don't know where i goes wrong,don'not have file you use.
You should not use f in both read and dump.

Here a example where i use a geojson file and put in François.

import json
from pprint import pprint

with open('map.geojson' , encoding='utf-8') as f:
    data = json.load(f)
    pprint(data)

Output:{'features': [{'geometry': {'coordinates': [[8.9208984375, 61.05828537037916],
                                            [9.84375, 61.4597705702975],
                                            [10.7666015625,
                                             60.930432202923335]],
                            'type': 'François'},
               'properties': {},
               'type': 'Feature'}],
 'type': 'FeatureCollection'}

Dump data.

import json

with open("data_file.json", "w", encoding='utf-8') as f_out:
    json.dump(data, f_out, ensure_ascii=False

Content of data_file.json.

Output:{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "type": "François",
        "coordinates": [
          [
            8.9208984375,
            61.05828537037916
          ],
          [
            9.84375,
            61.4597705702975
          ],
          [
            10.7666015625,
            60.930432202923335
          ]
        ]
      }
    }
  ]
}

Right way to write string into UTF-8 file?

User Panel Messages

Announcements