Python Forum
[SOLVED] [Windows] Converting filename to UTF8? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: [SOLVED] [Windows] Converting filename to UTF8? (/thread-38122.html)



[SOLVED] [Windows] Converting filename to UTF8? - Winfried - Sep-06-2022

Hello,

On Windows, I need to loop through a list of filenames, and use them into UTF8 documents.

Problem is, some might contain characters, and I get garbage (because of 1252?).

This doesn't work:

import pathlib

PATH=pathlib.Path(item).parent

BASENAME = pathlib.Path(item).stem
#NO CHANGE BASENAME.encode('UTF-8')
print("BASENAME is",BASENAME)

soup = BeautifulSoup(open(item, 'r'), 'xml')

name = soup.select_one("kml > Document > name")
if name:
	name.string = BASENAME
else:
	name = soup.new_tag("name")
	name.string = BASENAME
	doc = soup.select_one("kml > Document")
	doc.insert(0,name)

with open(OUTPUTFILE, "w") as file:
    file.write(soup.prettify(formatter=None))
How can I convert Windows filenames into UTF8?

Thank you.

[attachment=1967]


RE: [Windows] Converting filename to UTF8? - snippsat - Sep-06-2022

Keep it uft-8 all the way,make sure editor don't mess it when save stuff.
Can test files with chardetect.
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99
Example:
from bs4 import BeautifulSoup

"""
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
 <Document>
    <Placemark>
     Μῆνιν ἄειδε
    </Placemark>
    <Placemark>
     異體字字
    </Placemark>
 </Document>
</kml>
"""

soup = BeautifulSoup(open('pla.kml', encoding='utf-8'), 'xml')
mark = soup.find_all('Placemark')
print(mark)

with open('pla_out.kml', "w", encoding='utf-8') as fp:
    fp.write(soup.prettify(formatter=None))
Output:
[<Placemark>Μῆνιν ἄειδε</Placemark>, <Placemark>異體字字</Placemark>]
pla_out.kml
Output:
<?xml version="1.0" encoding="utf-8"?> <kml xmlns="http://www.opengis.net/kml/2.2"> <Document> <Placemark> Μῆνιν ἄειδε </Placemark> <Placemark> 異體字字 </Placemark> </Document> </kml>



RE: [Windows] Converting filename to UTF8? - Winfried - Sep-06-2022

Thanks.

Turns out Python outputs as Latin1 unless told to use another encoding. It's now displayed OK in an Editor.

For some reason, chardet doesn't detect it as UTF8, though:
C:\Python38-32\Scripts\chardetect.exe output.kml: ISO-8859-1 with confidence 0.683404255319149

from bs4 import BeautifulSoup
import pathlib
import os

…
PATH=pathlib.Path(item).parent
EXTENSION = pathlib.Path(item).suffix

BASENAME = pathlib.Path(item).stem

#Type is  <class 'str'>
print("Type is ", type(BASENAME))

OUTPUTFILE = f"{BASENAME}.EDITED{EXTENSION}"

os.chdir(PATH)

soup = BeautifulSoup(open(item, 'r'), 'xml')

name = soup.select_one("kml > Document > name")
if name:
	print("Name found")
	name.string = BASENAME
else:
	print("No name")
	name = soup.new_tag("name")
	name.string = BASENAME
	#get parent, and insert
	doc = soup.select_one("kml > Document")
	doc.insert(0,name)

#IMPORTANT!
with open(OUTPUTFILE, "w",encoding='utf-8') as file:
    file.write(soup.prettify(formatter=None))
[attachment=1969]


RE: [SOLVED] [Windows] Converting filename to UTF8? - snippsat - Sep-06-2022

(Sep-06-2022, 06:54 PM)Winfried Wrote: For some reason, chardet doesn't detect it as UTF8, though:
Try check file you take in and try to make sure that use utf-8 as default.
The file i test pla.kml is my input file from OS.
Also on your line 18 as i show you specify encoding.
soup = BeautifulSoup(open(item, 'r', encoding='utf-8'), 'xml') 
Like on windows eg simple editor Notepad++.
[Image: yCo00s.png]


RE: [SOLVED] [Windows] Converting filename to UTF8? - Winfried - Sep-06-2022

It's displayed fine in Notepad++.

I can live with chardetect misdetecting the encoding.

[attachment=1970]


RE: [SOLVED] [Windows] Converting filename to UTF8? - snippsat - Sep-06-2022

(Sep-06-2022, 10:30 PM)Winfried Wrote: t's displayed fine in Notepad++.

I can live with chardetect misdetecting the encoding.
Sure if working there is no problem.
Some tips when in Notepad++ most remember to save file not only display it.
Remember define encoding(utf-8) in Python both when take file in and save it out from Python,if not so can Windows mess it up and guess on wrong encoding
So in code i post #2, then file in from OS and out after Python is utf-8.
# In from OS
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99

# Out put from Python
G:\div_code\answer
λ chardetect pla_out.kml
pla_out.kml: utf-8 with confidence 0.99