Feb-26-2024, 02:28 PM
Hello,
In a directory, I have a bunch of HTML files that were written in cp-1252 (ie. Latin1) that I need to convert to utf-8.
The following doesn't seem to work: After running the loop once, the second run shows files still considered to be in cp-1252. What's the right way to proceed?
Thank you.
In a directory, I have a bunch of HTML files that were written in cp-1252 (ie. Latin1) that I need to convert to utf-8.
The following doesn't seem to work: After running the loop once, the second run shows files still considered to be in cp-1252. What's the right way to proceed?
Thank you.
import os import glob import chardet from bs4 import BeautifulSoup from datetime import datetime os.chdir(r".\input_test") files = glob.glob("*.html") for file in files: #detect encoding rawdata = open(file, "rb").read() encoding = chardet.detect(rawdata)['encoding'] if encoding in ["Windows-1252","ascii","ISO-8859-1"]: print("File still not in utf-8",file) continue print("Converting ",file) #get original access and modification times atime = os.stat(file).st_atime mtime = os.stat(file).st_mtime tup = (atime, mtime) #convert to utf8 data = open(file, "r").read() data.encode(encoding = 'UTF-8', errors = 'strict') with open(file, 'w', encoding='utf-8') as outp: outp.write(data) #set creation/modification back to original date os.utime(file, tup) elif encoding == "utf-8": #print("File in utf-8", file) pass else: print("Encoding error:", file, encoding)