Hello,
On Windows, I need to loop through a list of filenames, and use them into UTF8 documents.
Problem is, some might contain characters, and I get garbage (because of 1252?).
This doesn't work:
import pathlib
PATH=pathlib.Path(item).parent
BASENAME = pathlib.Path(item).stem
#NO CHANGE BASENAME.encode('UTF-8')
print("BASENAME is",BASENAME)
soup = BeautifulSoup(open(item, 'r'), 'xml')
name = soup.select_one("kml > Document > name")
if name:
name.string = BASENAME
else:
name = soup.new_tag("name")
name.string = BASENAME
doc = soup.select_one("kml > Document")
doc.insert(0,name)
with open(OUTPUTFILE, "w") as file:
file.write(soup.prettify(formatter=None))
How can I convert Windows filenames into UTF8?
Thank you.
[
attachment=1967]
Keep it
uft-8
all the way,make sure editor don't mess it when save stuff.
Can test files with
chardetect.
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99
Example:
from bs4 import BeautifulSoup
"""
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
Μῆνιν ἄειδε
</Placemark>
<Placemark>
異體字字
</Placemark>
</Document>
</kml>
"""
soup = BeautifulSoup(open('pla.kml', encoding='utf-8'), 'xml')
mark = soup.find_all('Placemark')
print(mark)
with open('pla_out.kml', "w", encoding='utf-8') as fp:
fp.write(soup.prettify(formatter=None))
Output:
[<Placemark>Μῆνιν ἄειδε</Placemark>, <Placemark>異體字字</Placemark>]
pla_out.kml
Output:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
Μῆνιν ἄειδε
</Placemark>
<Placemark>
異體字字
</Placemark>
</Document>
</kml>
Thanks.
Turns out Python outputs as Latin1 unless told to use another encoding. It's now displayed OK in an Editor.
For some reason, chardet doesn't detect it as UTF8, though:
C:\Python38-32\Scripts\chardetect.exe output.kml: ISO-8859-1 with confidence 0.683404255319149
from bs4 import BeautifulSoup
import pathlib
import os
…
PATH=pathlib.Path(item).parent
EXTENSION = pathlib.Path(item).suffix
BASENAME = pathlib.Path(item).stem
#Type is <class 'str'>
print("Type is ", type(BASENAME))
OUTPUTFILE = f"{BASENAME}.EDITED{EXTENSION}"
os.chdir(PATH)
soup = BeautifulSoup(open(item, 'r'), 'xml')
name = soup.select_one("kml > Document > name")
if name:
print("Name found")
name.string = BASENAME
else:
print("No name")
name = soup.new_tag("name")
name.string = BASENAME
#get parent, and insert
doc = soup.select_one("kml > Document")
doc.insert(0,name)
#IMPORTANT!
with open(OUTPUTFILE, "w",encoding='utf-8') as file:
file.write(soup.prettify(formatter=None))
[
attachment=1969]
(Sep-06-2022, 06:54 PM)Winfried Wrote: [ -> ]For some reason, chardet doesn't detect it as UTF8, though:
Try check file you take
in
and try to make sure that use utf-8 as default.
The file i test
pla.kml
is my input file from OS.
Also on your line 18 as i show you specify encoding.
soup = BeautifulSoup(open(item, 'r', encoding='utf-8'), 'xml')
Like on windows eg simple editor
Notepad++.
![[Image: yCo00s.png]](https://imagizer.imageshack.com/v2/xq90/922/yCo00s.png)
It's displayed fine in Notepad++.
I can live with chardetect misdetecting the encoding.
[
attachment=1970]
(Sep-06-2022, 10:30 PM)Winfried Wrote: [ -> ]t's displayed fine in Notepad++.
I can live with chardetect misdetecting the encoding.
Sure if working there is no problem.
Some tips when in Notepad++ most remember to save file not only display it.
Remember define encoding(utf-8) in Python both when take file
in
and save it
out
from Python,if not so can Windows mess it up and guess on wrong encoding
So in code i post #2, then file in from OS and out after Python is utf-8.
# In from OS
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99
# Out put from Python
G:\div_code\answer
λ chardetect pla_out.kml
pla_out.kml: utf-8 with confidence 0.99