Python Forum
[SOLVED] Correct way to convert file from cp-1252 to utf-8?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] Correct way to convert file from cp-1252 to utf-8?
#1
Question 
Hello,

In a directory, I have a bunch of HTML files that were written in cp-1252 (ie. Latin1) that I need to convert to utf-8.

The following doesn't seem to work: After running the loop once, the second run shows files still considered to be in cp-1252. What's the right way to proceed?

Thank you.

import os
import glob
import chardet
from bs4 import BeautifulSoup
from datetime import datetime

os.chdir(r".\input_test")

files = glob.glob("*.html")
for file in files:
  #detect encoding
  rawdata = open(file, "rb").read()
  encoding = chardet.detect(rawdata)['encoding']
  if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
    print("File still not in utf-8",file)
    continue
    
    print("Converting ",file)
    #get original access and modification times
    atime = os.stat(file).st_atime
    mtime = os.stat(file).st_mtime
    tup = (atime, mtime)

    #convert to utf8
    data = open(file, "r").read()
    data.encode(encoding = 'UTF-8', errors = 'strict')
    with open(file, 'w', encoding='utf-8') as outp:
      outp.write(data)
    #set creation/modification back to original date
    os.utime(file, tup)
  elif encoding == "utf-8":
    #print("File in utf-8", file)
    pass
  else:
    print("Encoding error:", file, encoding)
Reply
#2
You could replace line 25 with
data = rawdata.decode(encoding=encoding)
Then at line 27, open the file in binary mode, without encoding because data is normally a byte string after the encode() of line 26.
« We can solve any problem by introducing an extra level of indirection »
Reply
#3
The two lines you mean?
#data = open(file, "r").read()
#data.encode(encoding = 'UTF-8', errors = 'strict')
data = rawdata.decode(encoding=encoding)
#with open(file, 'w', encoding='utf-8') as outp:
with open(file, 'wb') as outp:
        outp.write(data) #TypeError: a bytes-like object is required, not 'str'
CHECK LATER It still doesn't work: Files that were supposedly converted to utf-8 in the first run are still considered as Windows files:
if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
  print("File still not in utf-8",file)
  continue
Also, the code above adds new carriage returns in the output :-/

<head>

<title>my title</title>

<meta name="description" content="my title">

<meta name="keywords" content="my title">

<meta name="classification" content="windows">

</head>
Reply
#4
After decoding, you need to add
data = data.encode(encoding = 'UTF-8', errors = 'strict')
bytes and unicode strings are different types in Python. Try to understand what the code does in details.
« We can solve any problem by introducing an extra level of indirection »
Reply
#5
Using this code, the second run still says files are not in utf-8. It doesn't look like it's the right way to convert Windows files to utf-8:

for file in files:
  rawdata = open(file, "rb").read()
  encoding = chardet.detect(rawdata)['encoding']
  if encoding in ["Windows-1252","ascii","ISO-8859-1"]:
    print("File still not in utf-8",file)
    continue
    
    print("Converting ",file)
    atime = os.stat(file).st_atime
    mtime = os.stat(file).st_mtime
    tup = (atime, mtime)

    #convert to utf8
    data = rawdata.decode(encoding=encoding)
    data = data.encode(encoding = 'UTF-8', errors = 'strict')
    with open(file, 'wb') as outp:
      outp.write(data) 

    #set creation/modification date
    os.utime(file, tup)
  elif encoding == "utf-8":
    #print(encoding)
    #print("File in utf-8", file)
    pass
  else:
    #ISO-8859-1
    #ascii
    print("Encoding error:", file, encoding)
    #exit()
Reply
#6
If a unicode string is encoded in utf-8 and written to a file, the file is encoded in utf8. No matter what chardet detects.
« We can solve any problem by introducing an extra level of indirection »
Reply
#7
Looks like it.

After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.

Bottom line: chardet doesn't seem reliable to check how a file is encoded :-/

Thanks for the help.
Reply
#8
(Feb-26-2024, 06:19 PM)Winfried Wrote: After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.
If a file contains only ASCII characters, there is no difference at all between the ASCII and the UTF8 encodings.
>>> s = 'hello world'
>>> s.encode('utf8')
b'hello world'
>>> s.encode('ascii')
b'hello world'
>>> 
« We can solve any problem by introducing an extra level of indirection »
Reply
#9
Turns out there's a lot easier solution: Just open the file and feed it to Beautiful Soup, which will take care of 1) converting data to utf-8 if needed, and add/edit the relevant meta line in the header.

file = r"c:\temp\input.html" 
with open(file, 'r') as f:
  content_text = f.read()

soup = BeautifulSoup(content_text, 'html.parser')
print(soup.head)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Convert File to Data URL michaelnicol 3 1,180 Jul-08-2023, 11:35 AM
Last Post: DeaD_EyE
  Python Script to convert Json to CSV file chvsnarayana 8 2,545 Apr-26-2023, 10:31 PM
Last Post: DeaD_EyE
  Loop through json file and reset values [SOLVED] AlphaInc 2 2,150 Apr-06-2023, 11:15 AM
Last Post: AlphaInc
Thumbs Up Convert an Interger into any base !? [Solved] SpongeB0B 8 1,434 Jan-16-2023, 10:24 AM
Last Post: SpongeB0B
  Convert Excel file into csv with Pipe symbol.. mg24 4 1,340 Oct-18-2022, 02:59 PM
Last Post: Larz60+
  Need Help: Convert .pcl file to .pdf file ManuRaval 6 2,562 Sep-13-2022, 01:31 PM
Last Post: ManuRaval
  [Solved by deanhystad] Create a zip file using zipfile library DZ_Galaxy 2 1,176 Aug-17-2022, 04:57 PM
Last Post: DZ_Galaxy
  Updating a config file [solved] ebolisa 8 2,610 Nov-04-2021, 10:20 AM
Last Post: Gribouillis
  |SOLVED] Glob JPGs, read EXIF, update file timestamp? Winfried 5 2,502 Oct-21-2021, 03:29 AM
Last Post: buran
  Convert legacy print file to XLSX file davidm 1 1,817 Oct-17-2021, 05:08 AM
Last Post: davidm

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020