Posts: 212
Threads: 94
Joined: Aug 2018
Aug-28-2018, 02:05 PM
(This post was last modified: Aug-28-2018, 02:05 PM by Winfried.)
Hello,
I need to append a string to a text file that's encoded in UTF-8.
It appears that, by default, Python 3 tries to write in ANSI (Latin-1, ISO8859-1, cp1252, or what ever is the correct name). As a result, I end up with a file that cannot be correctly displayed, since it uses two encoding methods in the same file.
![[Image: A961_E3_BF-_F782-46_E2-839_B-129370_CDF8_B8.png]](https://s22.postimg.cc/x06u6nild/A961_E3_BF-_F782-46_E2-839_B-129370_CDF8_B8.png)
(In ANSI, "è" is indeed 0xE8).
I tried the following but it doesn't work:
file = open("test.latin1.utf8.txt","w")
file.write("Crème")
stringutf8 ="Crème".encode('utf-8')
print(stringutf8)
#BAD Error: TypeError: write() argument must be str, not bytes
file.write(stringutf8)
file.close() Any idea how to do this?
Thank you.
---
Edit: Python won't let me open the file in UTF8 since it detects an ANSI character ("É" = 0xc9) wrongly added by another script; And it won't let me replace that faulty string either:
#Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 327: invalid continuation byte
f = codecs.open(inputfile, "r", "utf-8")
content = f.read()
f.close()
…
filename =" Crème"
#Error: AttributeError: 'str' object has no attribute 'decode'
filename = filename.decode('utf-8')
Posts: 7,318
Threads: 123
Joined: Sep 2016
Aug-28-2018, 02:27 PM
(This post was last modified: Aug-28-2018, 02:27 PM by snippsat.)
Python 3 has full Unicode support and has default encoding as UTf-8.
Always for file out and in use UTF-8,
do not encode decode anything if not necessary as for taking text(Unicode as default) out and in from Python 3.
s = 'Crème and Spicy jalapeño ☂'
with open('unicode.txt', 'w', encoding='utf-8') as f_out:
f_out.write(s) Output on disk.
Output: Crème and Spicy jalapeño ☂
Read in:
with open('unicode.txt', encoding='utf-8') as f:
print(f.read()) Output: Crème and Spicy jalapeño ☂
Posts: 212
Threads: 94
Joined: Aug 2018
Aug-28-2018, 04:04 PM
(This post was last modified: Aug-28-2018, 04:04 PM by Winfried.)
Thank you.
The code works fine as-is, but for some reason, it messes up the original file when I use this code:
with open ("input.gpx" , 'r' ) as f:
content = f.read()
content += 'Crème and Spicy jalapeño ☂'
with open(output.gpx", 'w', encoding='utf-8') as f_out:
f_out.write(content)
If you'd like to give it a quick shot: https://we.tl/t-neH7vye8wd
Posts: 7,318
Threads: 123
Joined: Sep 2016
Aug-28-2018, 05:31 PM
(This post was last modified: Aug-28-2018, 05:31 PM by snippsat.)
You don't have encoding='utf-8' when you read the file as i show.
Test with your file.
with open('input.gpx') as f:
print(f.read()) Output: <?xml version="1.0" encoding="UTF-8"?>
<desc>Départ entre 7 et 8h</desc>
</gpx>
Fix:
with open('input.gpx', encoding='utf-8') as f:
print(f.read()) Output: <?xml version="1.0" encoding="UTF-8"?>
<desc>Départ entre 7 et 8h</desc>
</gpx>
As it's a .xml file,BS parser test.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('input.gpx', encoding='utf-8'), 'xml')
print(soup.find('desc').text) Output: Départ entre 7 et 8h
Posts: 4,787
Threads: 76
Joined: Jan 2018
@ Winfried Make sure you are using python 3. One of the major achievement of python 3 over python 2 is the correct handling of unicode data.
Posts: 212
Threads: 94
Joined: Aug 2018
Aug-29-2018, 05:22 AM
(This post was last modified: Aug-29-2018, 05:22 AM by Winfried.)
Thanks much for the tip on open(…, encoding='utf-8') !
I've also learned that UltraEdit (famous Windows editor) encodes files in Latin1 while PyScripter (IDE) uses UTF-8, so the latter is a much better alternative when working with accented strings.
stuff = "Crème"
with open("cp1252.txt", 'w') as outFile:
outFile.write(stuff)
with open("utf8.txt", mode='w',encoding='utf-8') as outFile:
outFile.write(stuff) Using open() without any additional option meant that I ended up with a mix of Latin1 and UTF-8, which prevented me from using GpsBabel to merge GPX files.
I am using Python 3(.7.0).
Thanks again.
--
Edit: UltraEdit seems to have been rewritten to use Unicode instead.
Posts: 8,160
Threads: 160
Joined: Sep 2016
Aug-29-2018, 05:41 AM
(This post was last modified: Aug-29-2018, 05:41 AM by buran.)
(Aug-29-2018, 05:22 AM)Winfried Wrote: UltraEdit (famous Windows editor) never heard of it, so maybe not that famous, but anyway look at http://forums.ultraedit.com/set-default-...17446.html
EDIT: I see you added that there is new version with default UTF support
Posts: 212
Threads: 94
Joined: Aug 2018
It's a Windows editor that's been around for about 25 years.
I have yet another encoding issue, this time with geojson :-/
If I use its dump(), UTF8 data is turned into UTF16 (apparently), eg. "Fran\u00e7ois" instead of "François":
with open('input.geojson', encoding='utf-8') as f:
gj = geojson.load(f)
for track in gj['features']:
#NO DIFF with open(track['properties']['name'][0] + '.geojson', 'a+', encoding='utf-8') as f:
with open(track['properties']['name'][0] + '.geojson', 'a+') as f:
dump(track, f, indent=2)
#UnicodeEncodeError: 'charmap' codec can't encode character '\u2194' in position 7: character maps to <undefined>
#dump(track, f, indent=2, ensure_ascii=False)
#NOT DEFINED
#dumps(track, f, indent=2)
#AttributeError: encode
#dump(track.encode("utf-8"), f, indent=2, ensure_ascii=False) As shown, I found and tried several things, all to no avail.
Should I use another method than "dump"?
Posts: 212
Threads: 94
Joined: Aug 2018
If there's no way around it, I can live with accents being human-unreadable, but I wonder why JSON turns them into eg \u00e7.
BTW, I learned that you can't simply dump tracks into a file like I did above: To have a clean file, you must first build a list, and a collection, and dump the collection:
with open(INPUTFILE, encoding='utf-8') as f:
gj = geojson.load(f)
features = []
for track in gj['features']:
features.append(track)
feature_collection = FeatureCollection(features)
with open('myfile.geojson', 'w') as f:
dump(feature_collection, f, indent=2)
Posts: 7,318
Threads: 123
Joined: Sep 2016
(Aug-29-2018, 07:06 AM)Winfried Wrote: Should I use another method than "dump"? Not sure because don't know where i goes wrong,don'not have file you use.
You should not use f in both read and dump.
Here a example where i use a geojson file and put in François .
import json
from pprint import pprint
with open('map.geojson' , encoding='utf-8') as f:
data = json.load(f)
pprint(data) Output: {'features': [{'geometry': {'coordinates': [[8.9208984375, 61.05828537037916],
[9.84375, 61.4597705702975],
[10.7666015625,
60.930432202923335]],
'type': 'François'},
'properties': {},
'type': 'Feature'}],
'type': 'FeatureCollection'}
Dump data.
import json
with open("data_file.json", "w", encoding='utf-8') as f_out:
json.dump(data, f_out, ensure_ascii=False Content of data_file.json .
Output: {
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {},
"geometry": {
"type": "François",
"coordinates": [
[
8.9208984375,
61.05828537037916
],
[
9.84375,
61.4597705702975
],
[
10.7666015625,
60.930432202923335
]
]
}
}
]
}
|