Python Forum

Hello,

On Windows, I need to loop through a list of filenames, and use them into UTF8 documents.

Problem is, some might contain characters, and I get garbage (because of 1252?).

This doesn't work:

import pathlib

PATH=pathlib.Path(item).parent

BASENAME = pathlib.Path(item).stem
#NO CHANGE BASENAME.encode('UTF-8')
print("BASENAME is",BASENAME)

soup = BeautifulSoup(open(item, 'r'), 'xml')

name = soup.select_one("kml > Document > name")
if name:
	name.string = BASENAME
else:
	name = soup.new_tag("name")
	name.string = BASENAME
	doc = soup.select_one("kml > Document")
	doc.insert(0,name)

with open(OUTPUTFILE, "w") as file:
    file.write(soup.prettify(formatter=None))

How can I convert Windows filenames into UTF8?

Thank you.

[attachment=1967]

Keep it uft-8 all the way,make sure editor don't mess it when save stuff.
Can test files with chardetect.

G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99

Example:

from bs4 import BeautifulSoup

"""
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
 <Document>
    <Placemark>
     Μῆνιν ἄειδε
    </Placemark>
    <Placemark>
     異體字字
    </Placemark>
 </Document>
</kml>
"""

soup = BeautifulSoup(open('pla.kml', encoding='utf-8'), 'xml')
mark = soup.find_all('Placemark')
print(mark)

with open('pla_out.kml', "w", encoding='utf-8') as fp:
    fp.write(soup.prettify(formatter=None))

Output:
[<Placemark>Μῆνιν ἄειδε</Placemark>, <Placemark>異體字字</Placemark>]

pla_out.kml

Output:<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
 <Document>
    <Placemark>
     Μῆνιν ἄειδε
    </Placemark>
    <Placemark>
     異體字字
    </Placemark>
 </Document>
</kml>

Thanks.

Turns out Python outputs as Latin1 unless told to use another encoding. It's now displayed OK in an Editor.

For some reason, chardet doesn't detect it as UTF8, though:
C:\Python38-32\Scripts\chardetect.exe output.kml: ISO-8859-1 with confidence 0.683404255319149

from bs4 import BeautifulSoup
import pathlib
import os

…
PATH=pathlib.Path(item).parent
EXTENSION = pathlib.Path(item).suffix

BASENAME = pathlib.Path(item).stem

#Type is  <class 'str'>
print("Type is ", type(BASENAME))

OUTPUTFILE = f"{BASENAME}.EDITED{EXTENSION}"

os.chdir(PATH)

soup = BeautifulSoup(open(item, 'r'), 'xml')

name = soup.select_one("kml > Document > name")
if name:
	print("Name found")
	name.string = BASENAME
else:
	print("No name")
	name = soup.new_tag("name")
	name.string = BASENAME
	#get parent, and insert
	doc = soup.select_one("kml > Document")
	doc.insert(0,name)

#IMPORTANT!
with open(OUTPUTFILE, "w",encoding='utf-8') as file:
    file.write(soup.prettify(formatter=None))

[attachment=1969]

(Sep-06-2022, 06:54 PM)Winfried Wrote: [ -> ]For some reason, chardet doesn't detect it as UTF8, though:

Try check file you take in and try to make sure that use utf-8 as default.
The file i test pla.kml is my input file from OS.
Also on your line 18 as i show you specify encoding.

soup = BeautifulSoup(open(item, 'r', encoding='utf-8'), 'xml')

Like on windows eg simple editor Notepad++.

It's displayed fine in Notepad++.

I can live with chardetect misdetecting the encoding.

[attachment=1970]

(Sep-06-2022, 10:30 PM)Winfried Wrote: [ -> ]t's displayed fine in Notepad++.

I can live with chardetect misdetecting the encoding.

Sure if working there is no problem.
Some tips when in Notepad++ most remember to save file not only display it.
Remember define encoding(utf-8) in Python both when take file in and save it out from Python,if not so can Windows mess it up and guess on wrong encoding
So in code i post #2, then file in from OS and out after Python is utf-8.

# In from OS
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99

# Out put from Python
G:\div_code\answer
λ chardetect pla_out.kml
pla_out.kml: utf-8 with confidence 0.99

Winfried

snippsat

Winfried

snippsat

Winfried

snippsat