UTF-8 and UTF-16

garynewport · Dec-06-2022, 07:09 PM

I encountered an interesting problem in that PowerShell under Windows 11 will create any new file as UTF-16 instead of UTF-8. It is possible to set Powershell to act otherwise but this would require everyone to do this, if your code was to be used more widely.

I am reading in a file with over 30000 lines of data; in fact I am reading in several files, each with 30000 lines of data.

I open the file and then check each line for a keyword. If the keyword exists, I carry out some operation (what this is isn't relevant to this query).

My base code looks like this:

        
              keyword = "hydro"                                                                                   # Create an empty Numpy 2D array of 0 rows and 8 columns; all of the data type float
     
    file = open(datafile)
     
    for line in file:
        if keyword.casefold() in line.casefold():

All solutions require a reading in of the line, using a try/except statement, and, on an error, converting the line to UTF-8 or instructing the system to read it in as UTF-16. I do not want to read in each line; this'd add lots of time.

I could set encoding to UTF-8 and ignore errors, but then the UTF-16 isn't read in by Python correctly; the "hydro" instruction is ignored and/or the data is incorrect.

How would I identify the file's encoding and open accordingly?

**deanhystad** · (This post was last modified: Dec-06-2022, 08:16 PM by deanhystad.)

Why are you asking a windows powershell question on a Python forum?

Oops! Misread. You want to allow for different encodings. That is a Python question. You could use chardet.

https://pypi.org/project/chardet/

garynewport · Dec-06-2022, 08:41 PM

(Dec-06-2022, 08:16 PM)deanhystad Wrote: Why are you asking a windows powershell question on a Python forum?

Oops! Misread. You want to allow for different encodings. That is a Python question. You could use chardet.

https://pypi.org/project/chardet/

No problem, I wanted to give context so that future problem-seekers could see what the issue might be for them (it took me some time to identify the problem with powershell, and not with either of my programs).

I did see chardet but this will detect all encodings, through an exhaustive language check of the data file. Since there is no language in the data files chardet is unlikely to work (it is a confidence algorithm).

Since I know the data that my program is using and the only situation my program will face is either UTF-8 or UTF-16, I need a way of identifying this as quickly as possible.

I can't see where the encoding is recorded in the file's metadata; the only thing I have is the error.

The only thing I can imagine doing is opening the file as UTF-8, trying to read the file and, if this fails, close the file and re-open as UTF-16?

garynewport · (This post was last modified: Dec-06-2022, 08:56 PM by garynewport.)

I did use a try, and this works. It adds almost no time to the execution of the code, so I am a happy programmer! Big Grin

        
          
          
              
              file = open(datafile)
     
    for attempts in range(4):
        try:
            for line in file:
                if keyword.casefold() in line.casefold():
                    time      = convert(line[21:35])                                                                        # Timestep
                    delta_t   = convert(line[36:50])                                                                        # Change in timestep
                    mass      = convert(line[51:65])                                                                        # Mass of star (excluding envelope)
                    radius    = convert(line[66:80])                                                                        # Radius of star (excluding envelope)
                    lum_core  = convert(line[81:95])                                                                        # Luminosity of star (excluding envelope)
                    lum_tot   = convert(line[96:110])                                                                       # Total luminosity (including enveloping cloud)
                    flux      = convert(line[111:125])                                                                      # Mass flux
                    ratio     = float(line[125:137])                                                                        # Ratio of star mass against mass of the Sun
                 
                    # Store the data into the Numpy array 'data'
                    data = np.append(data, np.array([[time, delta_t, mass, radius, lum_core, lum_tot, flux, ratio]]), axis = 0)
            break
        except ValueError as error:
            if attempts < 3:
                file.close()
                file = open(datafile, encoding="utf-16")
            else:
                print("Error in file")
                print(error)

            

        
      

**deanhystad** · (This post was last modified: Dec-06-2022, 09:29 PM by deanhystad.)

You are leaving files open. Could possibly be a problem since both file objects reference the same file.

Why would you have more than 2 attempts?

Looks like you could read the first line and check the line length

        
              with open (filename, "r", encoding="utf-8") as file:
    encoding = "utf-16" if len(next(file)) > some_length else "utf-8"
with open (filename, "r", encoding=encoding) as file:
    . . .

garynewport · Dec-08-2022, 08:42 AM

(Dec-06-2022, 09:29 PM)deanhystad Wrote: You are leaving files open. Could possibly be a problem since both file objects reference the same file.

Why would you have more than 2 attempts?

Looks like you could read the first line and check the line length

1
2
3
4

with open (filename, "r", encoding="utf-8") as file:
encoding = "utf-16" if len(next(file)) > some_length else "utf-8"
with open (filename, "r", encoding=encoding) as file:
. . .

I do close the file, but a little further down; I trimmed the code a little.

I just chose an arbitrary number of attempts, as this was an experiment and the first time it ran it appeared to be in a perpetual loop (I used a while True: statement). I will now be reducing this to 2; or rethink how this is looped through anyway.

The main thing is this concept works! :)

UTF-8 and UTF-16

User Panel Messages

Announcements