Posts: 217
Threads: 96
Joined: Aug 2018
Sep-06-2022, 03:53 PM
(This post was last modified: Sep-06-2022, 07:09 PM by Winfried.)
Hello,
On Windows, I need to loop through a list of filenames, and use them into UTF8 documents.
Problem is, some might contain characters, and I get garbage (because of 1252?).
This doesn't work:
import pathlib
PATH=pathlib.Path(item).parent
BASENAME = pathlib.Path(item).stem
#NO CHANGE BASENAME.encode('UTF-8')
print("BASENAME is",BASENAME)
soup = BeautifulSoup(open(item, 'r'), 'xml')
name = soup.select_one("kml > Document > name")
if name:
name.string = BASENAME
else:
name = soup.new_tag("name")
name.string = BASENAME
doc = soup.select_one("kml > Document")
doc.insert(0,name)
with open(OUTPUTFILE, "w") as file:
file.write(soup.prettify(formatter=None)) How can I convert Windows filenames into UTF8?
Thank you.
Posts: 7,324
Threads: 123
Joined: Sep 2016
Sep-06-2022, 05:05 PM
(This post was last modified: Sep-06-2022, 05:05 PM by snippsat.)
Keep it uft-8 all the way,make sure editor don't mess it when save stuff.
Can test files with chardetect.
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99 Example:
from bs4 import BeautifulSoup
"""
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
Μῆνιν ἄειδε
</Placemark>
<Placemark>
異體字字
</Placemark>
</Document>
</kml>
"""
soup = BeautifulSoup(open('pla.kml', encoding='utf-8'), 'xml')
mark = soup.find_all('Placemark')
print(mark)
with open('pla_out.kml', "w", encoding='utf-8') as fp:
fp.write(soup.prettify(formatter=None)) Output: [<Placemark>Μῆνιν ἄειδε</Placemark>, <Placemark>異體字字</Placemark>]
pla_out.kml
Output: <?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
Μῆνιν ἄειδε
</Placemark>
<Placemark>
異體字字
</Placemark>
</Document>
</kml>
Posts: 217
Threads: 96
Joined: Aug 2018
Thanks.
Turns out Python outputs as Latin1 unless told to use another encoding. It's now displayed OK in an Editor.
For some reason, chardet doesn't detect it as UTF8, though:
C:\Python38-32\Scripts\chardetect.exe output.kml: ISO-8859-1 with confidence 0.683404255319149
from bs4 import BeautifulSoup
import pathlib
import os
…
PATH=pathlib.Path(item).parent
EXTENSION = pathlib.Path(item).suffix
BASENAME = pathlib.Path(item).stem
#Type is <class 'str'>
print("Type is ", type(BASENAME))
OUTPUTFILE = f"{BASENAME}.EDITED{EXTENSION}"
os.chdir(PATH)
soup = BeautifulSoup(open(item, 'r'), 'xml')
name = soup.select_one("kml > Document > name")
if name:
print("Name found")
name.string = BASENAME
else:
print("No name")
name = soup.new_tag("name")
name.string = BASENAME
#get parent, and insert
doc = soup.select_one("kml > Document")
doc.insert(0,name)
#IMPORTANT!
with open(OUTPUTFILE, "w",encoding='utf-8') as file:
file.write(soup.prettify(formatter=None))
Posts: 7,324
Threads: 123
Joined: Sep 2016
Sep-06-2022, 08:05 PM
(This post was last modified: Sep-06-2022, 08:05 PM by snippsat.)
(Sep-06-2022, 06:54 PM)Winfried Wrote: For some reason, chardet doesn't detect it as UTF8, though: Try check file you take in and try to make sure that use utf-8 as default.
The file i test pla.kml is my input file from OS.
Also on your line 18 as i show you specify encoding.
soup = BeautifulSoup(open(item, 'r', encoding='utf-8'), 'xml') Like on windows eg simple editor Notepad++.
Posts: 217
Threads: 96
Joined: Aug 2018
It's displayed fine in Notepad++.
I can live with chardetect misdetecting the encoding.
Posts: 7,324
Threads: 123
Joined: Sep 2016
Sep-06-2022, 10:47 PM
(This post was last modified: Sep-06-2022, 10:48 PM by snippsat.)
(Sep-06-2022, 10:30 PM)Winfried Wrote: t's displayed fine in Notepad++.
I can live with chardetect misdetecting the encoding. Sure if working there is no problem.
Some tips when in Notepad++ most remember to save file not only display it.
Remember define encoding(utf-8) in Python both when take file in and save it out from Python,if not so can Windows mess it up and guess on wrong encoding
So in code i post #2, then file in from OS and out after Python is utf-8.
# In from OS
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99
# Out put from Python
G:\div_code\answer
λ chardetect pla_out.kml
pla_out.kml: utf-8 with confidence 0.99
|