[SOLVED] [Windows] Converting filename to UTF8? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: [SOLVED] [Windows] Converting filename to UTF8? (/thread-38122.html) |
[SOLVED] [Windows] Converting filename to UTF8? - Winfried - Sep-06-2022 Hello, On Windows, I need to loop through a list of filenames, and use them into UTF8 documents. Problem is, some might contain characters, and I get garbage (because of 1252?). This doesn't work: import pathlib PATH=pathlib.Path(item).parent BASENAME = pathlib.Path(item).stem #NO CHANGE BASENAME.encode('UTF-8') print("BASENAME is",BASENAME) soup = BeautifulSoup(open(item, 'r'), 'xml') name = soup.select_one("kml > Document > name") if name: name.string = BASENAME else: name = soup.new_tag("name") name.string = BASENAME doc = soup.select_one("kml > Document") doc.insert(0,name) with open(OUTPUTFILE, "w") as file: file.write(soup.prettify(formatter=None))How can I convert Windows filenames into UTF8? Thank you. [attachment=1967] RE: [Windows] Converting filename to UTF8? - snippsat - Sep-06-2022 Keep it uft-8 all the way,make sure editor don't mess it when save stuff.Can test files with chardetect. G:\div_code\answer λ chardetect pla.kml pla.kml: utf-8 with confidence 0.99Example: from bs4 import BeautifulSoup """ <?xml version="1.0" encoding="utf-8"?> <kml xmlns="http://www.opengis.net/kml/2.2"> <Document> <Placemark> Μῆνιν ἄειδε </Placemark> <Placemark> 異體字字 </Placemark> </Document> </kml> """ soup = BeautifulSoup(open('pla.kml', encoding='utf-8'), 'xml') mark = soup.find_all('Placemark') print(mark) with open('pla_out.kml', "w", encoding='utf-8') as fp: fp.write(soup.prettify(formatter=None))
pla_out.kml
RE: [Windows] Converting filename to UTF8? - Winfried - Sep-06-2022 Thanks. Turns out Python outputs as Latin1 unless told to use another encoding. It's now displayed OK in an Editor. For some reason, chardet doesn't detect it as UTF8, though: C:\Python38-32\Scripts\chardetect.exe output.kml: ISO-8859-1 with confidence 0.683404255319149 from bs4 import BeautifulSoup import pathlib import os … PATH=pathlib.Path(item).parent EXTENSION = pathlib.Path(item).suffix BASENAME = pathlib.Path(item).stem #Type is <class 'str'> print("Type is ", type(BASENAME)) OUTPUTFILE = f"{BASENAME}.EDITED{EXTENSION}" os.chdir(PATH) soup = BeautifulSoup(open(item, 'r'), 'xml') name = soup.select_one("kml > Document > name") if name: print("Name found") name.string = BASENAME else: print("No name") name = soup.new_tag("name") name.string = BASENAME #get parent, and insert doc = soup.select_one("kml > Document") doc.insert(0,name) #IMPORTANT! with open(OUTPUTFILE, "w",encoding='utf-8') as file: file.write(soup.prettify(formatter=None))[attachment=1969] RE: [SOLVED] [Windows] Converting filename to UTF8? - snippsat - Sep-06-2022 (Sep-06-2022, 06:54 PM)Winfried Wrote: For some reason, chardet doesn't detect it as UTF8, though:Try check file you take in and try to make sure that use utf-8 as default.The file i test pla.kml is my input file from OS.Also on your line 18 as i show you specify encoding. soup = BeautifulSoup(open(item, 'r', encoding='utf-8'), 'xml')Like on windows eg simple editor Notepad++. RE: [SOLVED] [Windows] Converting filename to UTF8? - Winfried - Sep-06-2022 It's displayed fine in Notepad++. I can live with chardetect misdetecting the encoding. [attachment=1970] RE: [SOLVED] [Windows] Converting filename to UTF8? - snippsat - Sep-06-2022 (Sep-06-2022, 10:30 PM)Winfried Wrote: t's displayed fine in Notepad++.Sure if working there is no problem. Some tips when in Notepad++ most remember to save file not only display it. Remember define encoding(utf-8) in Python both when take file in and save it out from Python,if not so can Windows mess it up and guess on wrong encoding So in code i post #2, then file in from OS and out after Python is utf-8. # In from OS G:\div_code\answer λ chardetect pla.kml pla.kml: utf-8 with confidence 0.99 # Out put from Python G:\div_code\answer λ chardetect pla_out.kml pla_out.kml: utf-8 with confidence 0.99 |