[SOLVED] Correct way to convert file from cp-1252 to utf-8?

Winfried · (This post was last modified: Feb-26-2024, 06:19 PM by Winfried.)

Hello,

In a directory, I have a bunch of HTML files that were written in cp-1252 (ie. Latin1) that I need to convert to utf-8.

The following doesn't seem to work: After running the loop once, the second run shows files still considered to be in cp-1252. What's the right way to proceed?

Thank you.

import os
import glob
import chardet
from bs4 import BeautifulSoup
from datetime import datetime

os.chdir(r".\input_test")

files = glob.glob("*.html")
for file in files:
  #detect encoding
  rawdata = open(file, "rb").read()
  encoding = chardet.detect(rawdata)['encoding']
  if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
    print("File still not in utf-8",file)
    continue
    
    print("Converting ",file)
    #get original access and modification times
    atime = os.stat(file).st_atime
    mtime = os.stat(file).st_mtime
    tup = (atime, mtime)

    #convert to utf8
    data = open(file, "r").read()
    data.encode(encoding = 'UTF-8', errors = 'strict')
    with open(file, 'w', encoding='utf-8') as outp:
      outp.write(data)
    #set creation/modification back to original date
    os.utime(file, tup)
  elif encoding == "utf-8":
    #print("File in utf-8", file)
    pass
  else:
    print("Encoding error:", file, encoding)

**Gribouillis** · (This post was last modified: Feb-26-2024, 03:10 PM by Gribouillis.)

You could replace line 25 with

data = rawdata.decode(encoding=encoding)

Then at line 27, open the file in binary mode, without encoding because data is normally a byte string after the encode() of line 26.

Winfried · (This post was last modified: Feb-26-2024, 03:21 PM by Winfried.)

The two lines you mean?

#data = open(file, "r").read()
#data.encode(encoding = 'UTF-8', errors = 'strict')
data = rawdata.decode(encoding=encoding)
#with open(file, 'w', encoding='utf-8') as outp:
with open(file, 'wb') as outp:
        outp.write(data) #TypeError: a bytes-like object is required, not 'str'

CHECK LATER It still doesn't work: Files that were supposedly converted to utf-8 in the first run are still considered as Windows files:

if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
  print("File still not in utf-8",file)
  continue

Also, the code above adds new carriage returns in the output :-/

<head>

<title>my title</title>

<meta name="description" content="my title">

<meta name="keywords" content="my title">

<meta name="classification" content="windows">

</head>

**Gribouillis** · (This post was last modified: Feb-26-2024, 03:29 PM by Gribouillis.)

After decoding, you need to add

data = data.encode(encoding = 'UTF-8', errors = 'strict')

bytes and unicode strings are different types in Python. Try to understand what the code does in details.

Winfried · Feb-26-2024, 03:38 PM

Using this code, the second run still says files are not in utf-8. It doesn't look like it's the right way to convert Windows files to utf-8:

for file in files:
  rawdata = open(file, "rb").read()
  encoding = chardet.detect(rawdata)['encoding']
  if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
    print("File still not in utf-8",file)
    continue
    
    print("Converting ",file)
    atime = os.stat(file).st_atime
    mtime = os.stat(file).st_mtime
    tup = (atime, mtime)

    #convert to utf8
    data = rawdata.decode(encoding=encoding)
    data = data.encode(encoding = 'UTF-8', errors = 'strict')
    with open(file, 'wb') as outp:
      outp.write(data) 

    #set creation/modification date
    os.utime(file, tup)
  elif encoding == "utf-8":
    #print(encoding)
    #print("File in utf-8", file)
    pass
  else:
    #ISO-8859-1
    #ascii
    print("Encoding error:", file, encoding)
    #exit()

**Gribouillis** · Feb-26-2024, 06:01 PM

If a unicode string is encoded in utf-8 and written to a file, the file is encoded in utf8. No matter what chardet detects.

Winfried · Feb-26-2024, 06:19 PM

Looks like it.

After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.

Bottom line: chardet doesn't seem reliable to check how a file is encoded :-/

Thanks for the help.

**Gribouillis** · (This post was last modified: Feb-26-2024, 06:24 PM by Gribouillis.)

(Feb-26-2024, 06:19 PM)Winfried Wrote: After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.

If a file contains only ASCII characters, there is no difference at all between the ASCII and the UTF8 encodings.

>>> s = 'hello world'
>>> s.encode('utf8')
b'hello world'
>>> s.encode('ascii')
b'hello world'
>>>

Winfried · Feb-29-2024, 12:30 AM

Turns out there's a lot easier solution: Just open the file and feed it to Beautiful Soup, which will take care of 1) converting data to utf-8 if needed, and add/edit the relevant meta line in the header.

file = r"c:\temp\input.html" 
with open(file, 'r') as f:
  content_text = f.read()

soup = BeautifulSoup(content_text, 'html.parser')
print(soup.head)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Convert File to Data URL	michaelnicol	3	1,179	Jul-08-2023, 11:35 AM Last Post: DeaD_EyE
	Python Script to convert Json to CSV file	chvsnarayana	8	2,543	Apr-26-2023, 10:31 PM Last Post: DeaD_EyE
	Loop through json file and reset values [SOLVED]	AlphaInc	2	2,149	Apr-06-2023, 11:15 AM Last Post: AlphaInc
	Convert an Interger into any base !? [Solved]	SpongeB0B	8	1,433	Jan-16-2023, 10:24 AM Last Post: SpongeB0B
	Convert Excel file into csv with Pipe symbol..	mg24	4	1,340	Oct-18-2022, 02:59 PM Last Post: Larz60+
	Need Help: Convert .pcl file to .pdf file	ManuRaval	6	2,562	Sep-13-2022, 01:31 PM Last Post: ManuRaval
	[Solved by deanhystad] Create a zip file using zipfile library	DZ_Galaxy	2	1,175	Aug-17-2022, 04:57 PM Last Post: DZ_Galaxy
	Updating a config file [solved]	ebolisa	8	2,609	Nov-04-2021, 10:20 AM Last Post: Gribouillis
	\|SOLVED] Glob JPGs, read EXIF, update file timestamp?	Winfried	5	2,502	Oct-21-2021, 03:29 AM Last Post: buran
	Convert legacy print file to XLSX file	davidm	1	1,817	Oct-17-2021, 05:08 AM Last Post: davidm

[SOLVED] Correct way to convert file from cp-1252 to utf-8?

User Panel Messages

Announcements