Python Forum

Hello,

In a directory, I have a bunch of HTML files that were written in cp-1252 (ie. Latin1) that I need to convert to utf-8.

The following doesn't seem to work: After running the loop once, the second run shows files still considered to be in cp-1252. What's the right way to proceed?

Thank you.

import os
import glob
import chardet
from bs4 import BeautifulSoup
from datetime import datetime

os.chdir(r".\input_test")

files = glob.glob("*.html")
for file in files:
  #detect encoding
  rawdata = open(file, "rb").read()
  encoding = chardet.detect(rawdata)['encoding']
  if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
    print("File still not in utf-8",file)
    continue
    
    print("Converting ",file)
    #get original access and modification times
    atime = os.stat(file).st_atime
    mtime = os.stat(file).st_mtime
    tup = (atime, mtime)

    #convert to utf8
    data = open(file, "r").read()
    data.encode(encoding = 'UTF-8', errors = 'strict')
    with open(file, 'w', encoding='utf-8') as outp:
      outp.write(data)
    #set creation/modification back to original date
    os.utime(file, tup)
  elif encoding == "utf-8":
    #print("File in utf-8", file)
    pass
  else:
    print("Encoding error:", file, encoding)

You could replace line 25 with

data = rawdata.decode(encoding=encoding)

Then at line 27, open the file in binary mode, without encoding because data is normally a byte string after the encode() of line 26.

The two lines you mean?

#data = open(file, "r").read()
#data.encode(encoding = 'UTF-8', errors = 'strict')
data = rawdata.decode(encoding=encoding)
#with open(file, 'w', encoding='utf-8') as outp:
with open(file, 'wb') as outp:
        outp.write(data) #TypeError: a bytes-like object is required, not 'str'

CHECK LATER It still doesn't work: Files that were supposedly converted to utf-8 in the first run are still considered as Windows files:

if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
  print("File still not in utf-8",file)
  continue

Also, the code above adds new carriage returns in the output :-/

<head>

<title>my title</title>

<meta name="description" content="my title">

<meta name="keywords" content="my title">

<meta name="classification" content="windows">

</head>

After decoding, you need to add

data = data.encode(encoding = 'UTF-8', errors = 'strict')

bytes and unicode strings are different types in Python. Try to understand what the code does in details.

Using this code, the second run still says files are not in utf-8. It doesn't look like it's the right way to convert Windows files to utf-8:

for file in files:
  rawdata = open(file, "rb").read()
  encoding = chardet.detect(rawdata)['encoding']
  if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
    print("File still not in utf-8",file)
    continue
    
    print("Converting ",file)
    atime = os.stat(file).st_atime
    mtime = os.stat(file).st_mtime
    tup = (atime, mtime)

    #convert to utf8
    data = rawdata.decode(encoding=encoding)
    data = data.encode(encoding = 'UTF-8', errors = 'strict')
    with open(file, 'wb') as outp:
      outp.write(data) 

    #set creation/modification date
    os.utime(file, tup)
  elif encoding == "utf-8":
    #print(encoding)
    #print("File in utf-8", file)
    pass
  else:
    #ISO-8859-1
    #ascii
    print("Encoding error:", file, encoding)
    #exit()

If a unicode string is encoded in utf-8 and written to a file, the file is encoded in utf8. No matter what chardet detects.

Looks like it.

After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.

Bottom line: chardet doesn't seem reliable to check how a file is encoded :-/

Thanks for the help.

(Feb-26-2024, 06:19 PM)Winfried Wrote: [ -> ]After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.

If a file contains only ASCII characters, there is no difference at all between the ASCII and the UTF8 encodings.

>>> s = 'hello world'
>>> s.encode('utf8')
b'hello world'
>>> s.encode('ascii')
b'hello world'
>>>

Turns out there's a lot easier solution: Just open the file and feed it to Beautiful Soup, which will take care of 1) converting data to utf-8 if needed, and add/edit the relevant meta line in the header.

file = r"c:\temp\input.html" 
with open(file, 'r') as f:
  content_text = f.read()

soup = BeautifulSoup(content_text, 'html.parser')
print(soup.head)

Winfried

Gribouillis

Winfried

Gribouillis

Winfried

Gribouillis

Winfried

Gribouillis

Winfried