Python Forum
[SOLVED] [Windows] Converting filename to UTF8?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] [Windows] Converting filename to UTF8?
#1
Hello,

On Windows, I need to loop through a list of filenames, and use them into UTF8 documents.

Problem is, some might contain characters, and I get garbage (because of 1252?).

This doesn't work:

import pathlib

PATH=pathlib.Path(item).parent

BASENAME = pathlib.Path(item).stem
#NO CHANGE BASENAME.encode('UTF-8')
print("BASENAME is",BASENAME)

soup = BeautifulSoup(open(item, 'r'), 'xml')

name = soup.select_one("kml > Document > name")
if name:
	name.string = BASENAME
else:
	name = soup.new_tag("name")
	name.string = BASENAME
	doc = soup.select_one("kml > Document")
	doc.insert(0,name)

with open(OUTPUTFILE, "w") as file:
    file.write(soup.prettify(formatter=None))
How can I convert Windows filenames into UTF8?

Thank you.

   
Reply
#2
Keep it uft-8 all the way,make sure editor don't mess it when save stuff.
Can test files with chardetect.
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99
Example:
from bs4 import BeautifulSoup

"""
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
 <Document>
    <Placemark>
     Μῆνιν ἄειδε
    </Placemark>
    <Placemark>
     異體字字
    </Placemark>
 </Document>
</kml>
"""

soup = BeautifulSoup(open('pla.kml', encoding='utf-8'), 'xml')
mark = soup.find_all('Placemark')
print(mark)

with open('pla_out.kml', "w", encoding='utf-8') as fp:
    fp.write(soup.prettify(formatter=None))
Output:
[<Placemark>Μῆνιν ἄειδε</Placemark>, <Placemark>異體字字</Placemark>]
pla_out.kml
Output:
<?xml version="1.0" encoding="utf-8"?> <kml xmlns="http://www.opengis.net/kml/2.2"> <Document> <Placemark> Μῆνιν ἄειδε </Placemark> <Placemark> 異體字字 </Placemark> </Document> </kml>
Reply
#3
Thanks.

Turns out Python outputs as Latin1 unless told to use another encoding. It's now displayed OK in an Editor.

For some reason, chardet doesn't detect it as UTF8, though:
C:\Python38-32\Scripts\chardetect.exe output.kml: ISO-8859-1 with confidence 0.683404255319149

from bs4 import BeautifulSoup
import pathlib
import os

…
PATH=pathlib.Path(item).parent
EXTENSION = pathlib.Path(item).suffix

BASENAME = pathlib.Path(item).stem

#Type is  <class 'str'>
print("Type is ", type(BASENAME))

OUTPUTFILE = f"{BASENAME}.EDITED{EXTENSION}"

os.chdir(PATH)

soup = BeautifulSoup(open(item, 'r'), 'xml')

name = soup.select_one("kml > Document > name")
if name:
	print("Name found")
	name.string = BASENAME
else:
	print("No name")
	name = soup.new_tag("name")
	name.string = BASENAME
	#get parent, and insert
	doc = soup.select_one("kml > Document")
	doc.insert(0,name)

#IMPORTANT!
with open(OUTPUTFILE, "w",encoding='utf-8') as file:
    file.write(soup.prettify(formatter=None))
   
Reply
#4
(Sep-06-2022, 06:54 PM)Winfried Wrote: For some reason, chardet doesn't detect it as UTF8, though:
Try check file you take in and try to make sure that use utf-8 as default.
The file i test pla.kml is my input file from OS.
Also on your line 18 as i show you specify encoding.
soup = BeautifulSoup(open(item, 'r', encoding='utf-8'), 'xml') 
Like on windows eg simple editor Notepad++.
[Image: yCo00s.png]
Reply
#5
It's displayed fine in Notepad++.

I can live with chardetect misdetecting the encoding.

   
Reply
#6
(Sep-06-2022, 10:30 PM)Winfried Wrote: t's displayed fine in Notepad++.

I can live with chardetect misdetecting the encoding.
Sure if working there is no problem.
Some tips when in Notepad++ most remember to save file not only display it.
Remember define encoding(utf-8) in Python both when take file in and save it out from Python,if not so can Windows mess it up and guess on wrong encoding
So in code i post #2, then file in from OS and out after Python is utf-8.
# In from OS
G:\div_code\answer
λ chardetect pla.kml
pla.kml: utf-8 with confidence 0.99

# Out put from Python
G:\div_code\answer
λ chardetect pla_out.kml
pla_out.kml: utf-8 with confidence 0.99
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [SOLVED] [Windows] Fails reading strings with accents Winfried 1 844 Apr-23-2023, 05:27 PM
Last Post: Larz60+
  [SOLVED] [Windows] Right way to prompt for directory? Winfried 4 2,065 Jan-17-2023, 09:28 PM
Last Post: markoberk
  Who converts data when writing to a database with an encoding different from utf8? AlekseyPython 1 2,385 Mar-04-2019, 08:26 AM
Last Post: DeaD_EyE
  Want a list utf8 formatted but bytestrings found nikos 28 11,528 Feb-18-2019, 08:26 AM
Last Post: nikos
  SOLVED: best way to block (wait on) shell calls to multiple windows programs at once? ezdev 0 2,612 Dec-10-2017, 06:42 AM
Last Post: ezdev
  [?] UTF8, Unicode and Binary data reading troubles doublezero 1 3,171 Mar-31-2017, 11:32 PM
Last Post: Ofnuts

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020