Posts: 212
Threads: 94
Joined: Aug 2018
Feb-26-2024, 02:28 PM
(This post was last modified: Feb-26-2024, 06:19 PM by Winfried.)
Hello,
In a directory, I have a bunch of HTML files that were written in cp-1252 (ie. Latin1) that I need to convert to utf-8.
The following doesn't seem to work: After running the loop once, the second run shows files still considered to be in cp-1252. What's the right way to proceed?
Thank you.
import os
import glob
import chardet
from bs4 import BeautifulSoup
from datetime import datetime
os.chdir(r".\input_test")
files = glob.glob("*.html")
for file in files:
#detect encoding
rawdata = open(file, "rb").read()
encoding = chardet.detect(rawdata)['encoding']
if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
print("File still not in utf-8",file)
continue
print("Converting ",file)
#get original access and modification times
atime = os.stat(file).st_atime
mtime = os.stat(file).st_mtime
tup = (atime, mtime)
#convert to utf8
data = open(file, "r").read()
data.encode(encoding = 'UTF-8', errors = 'strict')
with open(file, 'w', encoding='utf-8') as outp:
outp.write(data)
#set creation/modification back to original date
os.utime(file, tup)
elif encoding == "utf-8":
#print("File in utf-8", file)
pass
else:
print("Encoding error:", file, encoding)
Posts: 4,786
Threads: 76
Joined: Jan 2018
Feb-26-2024, 03:10 PM
(This post was last modified: Feb-26-2024, 03:10 PM by Gribouillis.)
You could replace line 25 with
data = rawdata.decode(encoding=encoding) Then at line 27, open the file in binary mode, without encoding because data is normally a byte string after the encode() of line 26.
« We can solve any problem by introducing an extra level of indirection »
Posts: 212
Threads: 94
Joined: Aug 2018
Feb-26-2024, 03:21 PM
(This post was last modified: Feb-26-2024, 03:21 PM by Winfried.)
The two lines you mean?
#data = open(file, "r").read()
#data.encode(encoding = 'UTF-8', errors = 'strict')
data = rawdata.decode(encoding=encoding)
#with open(file, 'w', encoding='utf-8') as outp:
with open(file, 'wb') as outp:
outp.write(data) #TypeError: a bytes-like object is required, not 'str' CHECK LATER It still doesn't work: Files that were supposedly converted to utf-8 in the first run are still considered as Windows files:
if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
print("File still not in utf-8",file)
continue Also, the code above adds new carriage returns in the output :-/
<head>
<title>my title</title>
<meta name="description" content="my title">
<meta name="keywords" content="my title">
<meta name="classification" content="windows">
</head>
Posts: 4,786
Threads: 76
Joined: Jan 2018
Feb-26-2024, 03:28 PM
(This post was last modified: Feb-26-2024, 03:29 PM by Gribouillis.)
After decoding, you need to add
data = data.encode(encoding = 'UTF-8', errors = 'strict') bytes and unicode strings are different types in Python. Try to understand what the code does in details.
« We can solve any problem by introducing an extra level of indirection »
Posts: 212
Threads: 94
Joined: Aug 2018
Using this code, the second run still says files are not in utf-8. It doesn't look like it's the right way to convert Windows files to utf-8:
for file in files:
rawdata = open(file, "rb").read()
encoding = chardet.detect(rawdata)['encoding']
if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
print("File still not in utf-8",file)
continue
print("Converting ",file)
atime = os.stat(file).st_atime
mtime = os.stat(file).st_mtime
tup = (atime, mtime)
#convert to utf8
data = rawdata.decode(encoding=encoding)
data = data.encode(encoding = 'UTF-8', errors = 'strict')
with open(file, 'wb') as outp:
outp.write(data)
#set creation/modification date
os.utime(file, tup)
elif encoding == "utf-8":
#print(encoding)
#print("File in utf-8", file)
pass
else:
#ISO-8859-1
#ascii
print("Encoding error:", file, encoding)
#exit()
Posts: 4,786
Threads: 76
Joined: Jan 2018
If a unicode string is encoded in utf-8 and written to a file, the file is encoded in utf8. No matter what chardet detects.
« We can solve any problem by introducing an extra level of indirection »
Posts: 212
Threads: 94
Joined: Aug 2018
Looks like it.
After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.
Bottom line: chardet doesn't seem reliable to check how a file is encoded :-/
Thanks for the help.
Posts: 4,786
Threads: 76
Joined: Jan 2018
Feb-26-2024, 06:23 PM
(This post was last modified: Feb-26-2024, 06:24 PM by Gribouillis.)
(Feb-26-2024, 06:19 PM)Winfried Wrote: After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8. If a file contains only ASCII characters, there is no difference at all between the ASCII and the UTF8 encodings.
>>> s = 'hello world'
>>> s.encode('utf8')
b'hello world'
>>> s.encode('ascii')
b'hello world'
>>>
« We can solve any problem by introducing an extra level of indirection »
Posts: 212
Threads: 94
Joined: Aug 2018
Turns out there's a lot easier solution: Just open the file and feed it to Beautiful Soup, which will take care of 1) converting data to utf-8 if needed, and add/edit the relevant meta line in the header.
file = r"c:\temp\input.html"
with open(file, 'r') as f:
content_text = f.read()
soup = BeautifulSoup(content_text, 'html.parser')
print(soup.head)
|