'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte

karkas · (This post was last modified: Sep-02-2019, 09:45 PM by karkas.)

Hi everyone,

I'm getting this error and have been looking online but don't really understand for my specific case and don't really know why this could be happening.

This is the error : 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte

I'm trying to read a text file with the following lines

        
              inFile = open(fileName, 'r', encoding="utf8")
fileList = []
for line in inFile:
    fileList.append(line)

What I'm reading is a simple SRT file. I created a program that takes an SRT file and fixes the timestamps to eliminate overlapping because the editor does this sometimes. This function does this correctly and doesn't have this problem when reading and, when I create the new, corrected file, I'm just copying the old file and replacing the lines with timestamps with the corrected ones. However, when I try to read the newly generated file to do a conversion to another format I have this problem. I've been working with functions that convert and manipulate this kind of files for a while, but I had never gotten this error, just a similar one that I can't remember now, that's why I used the encoding="utf8".

I don't really know what "position 184" means, none of the lines is even longer than 33 characters, and line 184 of the file is an empty line with only an EOL character.

I'm thinking it's the new timestamps I'm writing that have this problem, but have no clue which character may be. When I look for the character 0xda, I find it's a Ú; however, that character is being read normally in other instances and I'm not even overwriting it.

If some of you happen to have not seen an SRT file before, it looks like this:

9
00:00:15,377 --> 00:00:18,570
ESTAMOS HACIENDO
UN FASCINANTE EXPERIMENTO.

10
00:00:19,150 --> 00:00:20,280
AÚN LO ESCUCHO.

The lines where I do the replacement are the following:

        
              inList[line] = hoursBegin + ':' + minutesBegin + ':' +  secondsBegin + ',' + millisecondsBegin + ' --> ' +\
            hoursEnd + ':' + minutesEnd + ':' + secondsEnd + ',' + millisecondsEnd + '\n'

Thanks in advance.

PD: Please excuse me if I'm not being very clear about some things, just let me know and I'll clarify. I've been working for long hours and I'm kind of stuck.

***snippsat*** · (This post was last modified: Sep-03-2019, 11:04 AM by snippsat.)

Try detect encoding chardet of file.
If i save test from your example,it will work as i always save files in utf-8.

        
              with open('test.srt', encoding='utf-8f') as f:
    file_list = [i.strip() for i in f]

Output:>>> file_list
['9',
 '00:00:15,377 --> 00:00:18,570',
 'ESTAMOS HACIENDO',
 'UN FASCINANTE EXPERIMENTO.',
 '',
 '10',
 '00:00:19,150 --> 00:00:20,280',
 'AÚN LO ESCUCHO.']

Can also read in with utf-8 and errors='ignore' or errors='replace'.

        
              with open('test.srt', encoding='utf-8f', errors='replace') as f:
    file_list = [i.strip() for i in f]

Quote:The lines where I do the replacement are the following:

Use string formatting f-string,then it look much nicer than all +.

        
              hoursBegin = '12'
minutesBegin = '55'
print(f'{hoursBegin}:{minutesBegin}')

Output:
12:55

karkas · (This post was last modified: Sep-06-2019, 02:02 PM by karkas.)

Thank you very much for your very complete reply, snippsat.

Quote:Try detect encoding chardet of file.

This has taken me some time and I haven't found a way to figure it out. I tried this:

        
              import chardet
rawData=open('1_Consumer Reports 02.srt',"r").read()
rawDataBytes = bytes(rawData, 'utf-8')
chardet.detect(rawDataBytes)

with the result being

Output:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

But then, with other tests, I've realized that chardet.detect is actually detecting the encoding I pass to the bytes function.

How can I pass the raw data in binary format of a read file without "altering" the encoding, which is what this appears to be doing? I got several errors before I made this "work."

Quote:If i save test from your example,it will work as i always save files in utf-8.

I tried this and realized that the excerpt I pasted and you used to run this test, doesn't raise this exception. By the way, the 'f' after the '8' in the encoding raises another itself, I guess it's just a typo.

        
              with open('test.srt', encoding='utf-8f') as f:
    file_list = [i.strip() for i in f]

This excerpt does raise the exact same exception.

1
00:00:00,066 --> 00:00:01,888
HOLA, SOY JACK RICO,

2
00:00:01,888 --> 00:00:04,444
Y ESTO ES "TALLER
DEL CONSUMIDOR".

3
00:00:04,444 --> 00:00:05,530
[MÚSICA]

* The italics were just to try something with the converter I'm writing. I'm pasting the files as is.

Quote:Can also read in with utf-8 and errors='ignore' or errors='replace'.

I tried this and realized the problem is with the Ú in 'MÚSICA' in the third subtitle. However, doing this alters the file and it won't be correctly converted.

What I still don't understand is why, if I'm not rewriting that character and it's read normally by the first function that uses that file, it gets messed up and then, even if I read with 'utf-8', I still get this error.

Quote:Use string formatting f-string,then it look much nicer than all +.

This is great. I've been reading a little about it and, from that and what I can understand from your example, I don't even need the quotes along with the +. I don't even need to convert ints to strings, right?

Thank you very much for your examples and explanations.

***snippsat*** · (This post was last modified: Sep-06-2019, 03:33 PM by snippsat.)

(Sep-06-2019, 02:02 PM)karkas Wrote: This has taken me some time and I haven't found a way to figure it out. I tried this:

It also install chardetect that can be command line(cmd) or cmder as i use.

        
              C:\code
λ chardetect myfile.txt
myfile.txt: ascii with confidence 1.0

Quote:This excerpt does raise the exact same exception.

It should not if the file is saved as utf-8.

        
              # copy of text in post and save as utf-8
λ chardetect test.srt
test.srt: utf-8 with confidence 0.505

Test.

        
          
          
              
              λ python -V
Python 3.7.3
 
# Code used
E:\div_code
λ cat uni_music.py
with open('test.srt', encoding='utf-8') as f:
    file_list = [i.strip() for i in f]
 
# Run code interactively 
E:\div_code
λ ptpython -i uni_music.py
>>> from pprint import pprint
 
>>> pprint(file_list)
['1',
 '00:00:00,066 --> 00:00:01,888',
 'HOLA, SOY <i>JACK RICO,</i>',
 '',
 '2',
 '00:00:01,888 --> 00:00:04,444',
 '<i>Y ESTO ES </i>"TALLER',
 'DEL CONSUMIDOR".',
 '',
 '3',
 '00:00:04,444 --> 00:00:05,530',
 '<i>[MÚSICA]</i>']
 
# Last element,all is correct
>>> file_list[-1]
'<i>[MÚSICA]</i>'

            

        
      

Quote:How can I pass the raw data in binary format of a read file without "altering" the encoding

you use 'rb',but then still need to decode or the Unicode will look like this M\xc3\x9aSICA.

        
              with open('test.srt', 'rb') as f:
    file_list = [i.strip() for i in f]

Output:>>> file_list
[b'1',
 b'00:00:00,066 --> 00:00:01,888',
 b'HOLA, SOY <i>JACK RICO,</i>',
 b'',
 b'2',
 b'00:00:01,888 --> 00:00:04,444',
 b'<i>Y ESTO ES </i>"TALLER',
 b'DEL CONSUMIDOR".',
 b'',
 b'3',
 b'00:00:04,444 --> 00:00:05,530',
 b'<i>[M\xc3\x9aSICA]</i>']

        
              >>> file_list[-1]
b'<i>[M\xc3\x9aSICA]</i>'
 
>>> file_list[-1].decode() # Same as decode('utf-8') this is default
'<i>[MÚSICA]</i>'

karkas · Sep-07-2019, 02:46 AM

Okay, I found the source of the problem and the solution and, honestly, it was a bit dumb. I just didn't look very well into that.

When I was opening the new file for writing, I had done:

        
              newFile = open(fileName, 'w+')

which used I don't know what encoding scheme. After opening as 'utf-8' for writing, the file works well with the subsequent converter.
For some reason, I wrongly assumed—unconsciously, actually—that after reading the contents with that encoding, they will be written in the same way to a new file.

Either way, you have taught me a lot here.

About your examples, they go beyond my understanding and knowledge at the time.

- C:\code would be a path to my code file?

- What does that λ mean?

- What is the difference between your first and second example? Just the same with different encoding depending on how you save the file?

I think I have too many questions about the test code.

Thank you very much, again.

***snippsat*** · (This post was last modified: Sep-07-2019, 11:05 AM by snippsat.)

(Sep-07-2019, 02:46 AM)karkas Wrote: After opening as 'utf-8' for writing, the file works well with the subsequent converter.

Yes,remember utf-8 out and in,OS and local system can mess up stuff and choice wrong encoding.

        
              newFile = open(fileName, 'w+', encoding='utf-8')

Quote:- C:\code would be a path to my code file?

No this is where i placed the file,you can choose whatever place you like.
Rember that python and pip should work from any folder same for all OS,for Windows look here
So a test like this should work no matter which folder on OS you are in.

        
              E:\div_code\click
λ python -V
Python 3.7.3
 
E:\div_code\click
λ pip -V
pip 19.2.3 from c:\python37\lib\site-packages\pip (python 3.7)

Quote:- What does that λ mean?

That use cmder a much better shell than cmd/Poweshell.

Quote:- What is the difference between your first and second example? Just the same with different encoding depending on how you save the file?

When i use rb then i read file in binary(no encoding).
With Unicode still need to decode to a encoding.

        
              >>> s = b'<i>[M\xc3\x9aSICA]</i>'
>>> s
b'<i>[M\xc3\x9aSICA]</i>'
>>> type(s)
<class 'bytes'>
>>> 
>>> t = s.decode()
>>> t
'<i>[MÚSICA]</i>'
>>> type(t)
<class 'str'>

karkas · Sep-12-2019, 08:51 PM

Quote:No this is where i placed the file,you can choose whatever place you like.
Rember that python and pip should work from any folder same for all OS,for Windows look here
So a test like this should work no matter which folder on OS you are in.

Oh yes, I actually learned this when following your advice in the first reply to use chardet. I had to add the python path to the environmental variables and then install pip to be able to install chardet.

Quote:That use cmder a much better shell than cmd/Poweshell.

That would be another thing for me to look at and learn.

Quote:When i use rb then i read file in binary(no encoding).
With Unicode still need to decode to a encoding.

I actually meant these two examples:

        
              C:\code
λ chardetect myfile.txt
myfile.txt: ascii with confidence 1.0

        
              # copy of text in post and save as utf-8
λ chardetect test.srt
test.srt: utf-8 with confidence 0.505

Thank you very much for your help and all your insights. This has been great because it's given me a lot more to study and learn.

Best regards.

newbieAuggie2019 · Sep-12-2019, 11:19 PM

(Sep-02-2019, 09:45 PM)karkas Wrote: 9
00:00:15,377 --> 00:00:18,570
ESTAMOS HACIENDO
UN FASCINANTE EXPERIMENTO.

10
00:00:19,150 --> 00:00:20,280
AÚN LO ESCUCHO.

(Sep-06-2019, 02:02 PM)karkas Wrote: 1
00:00:00,066 --> 00:00:01,888
HOLA, SOY JACK RICO,

2
00:00:01,888 --> 00:00:04,444
Y ESTO ES "TALLER
DEL CONSUMIDOR".

3
00:00:04,444 --> 00:00:05,530
[MÚSICA]

Hi!

I think you are dealing with subtitles, and although it's a personal option, I would like to comment on something off your question. It's about formatting (by the way, you asked also about 'f'. It's also related to formatting strings).

I personally (so you can completely avoid my suggestions) find UPPERCASE LETTERS in subtitles (like in a chat), as if somebody is shouting, instead of speaking (actually, when lowercase letters are used, sometimes UPPERCASE LETTERS are then used to mean that somebody is shouting or emphasizing something).

Therefore, I would personally use something like the following, instead of what you provide as an example (of course, you can completely ignore my advice):

Quote:9
00:00:15,377 --> 00:00:18,570
Estamos haciendo
un experimento fascinante.

10
00:00:19,150 --> 00:00:20,280
Todavía lo escucho.

Quote:1
00:00:00,066 --> 00:00:01,888
¡Hola! Soy Jack Rico,

2
00:00:01,888 --> 00:00:04,444
y esto es "Taller
del Consumidor".

3
00:00:04,444 --> 00:00:05,530
[Música]

All the best,

karkas · Feb-08-2020, 06:58 PM

(Sep-12-2019, 11:19 PM)newbieAuggie2019 Wrote:
(Sep-02-2019, 09:45 PM)karkas Wrote: 9
00:00:15,377 --> 00:00:18,570
ESTAMOS HACIENDO
UN FASCINANTE EXPERIMENTO.

10
00:00:19,150 --> 00:00:20,280
AÚN LO ESCUCHO.

(Sep-06-2019, 02:02 PM)karkas Wrote: 1
00:00:00,066 --> 00:00:01,888
HOLA, SOY JACK RICO,

2
00:00:01,888 --> 00:00:04,444
Y ESTO ES "TALLER
DEL CONSUMIDOR".

3
00:00:04,444 --> 00:00:05,530
[MÚSICA]

Hi!

I think you are dealing with subtitles, and although it's a personal option, I would like to comment on something off your question. It's about formatting (by the way, you asked also about 'f'. It's also related to formatting strings).

I personally (so you can completely avoid my suggestions) find UPPERCASE LETTERS in subtitles (like in a chat), as if somebody is shouting, instead of speaking (actually, when lowercase letters are used, sometimes UPPERCASE LETTERS are then used to mean that somebody is shouting or emphasizing something).

Therefore, I would personally use something like the following, instead of what you provide as an example (of course, you can completely ignore my advice):

Quote:9
00:00:15,377 --> 00:00:18,570
Estamos haciendo
un experimento fascinante.

10
00:00:19,150 --> 00:00:20,280
Todavía lo escucho.

Quote:1
00:00:00,066 --> 00:00:01,888
¡Hola! Soy Jack Rico,

2
00:00:01,888 --> 00:00:04,444
y esto es "Taller
del Consumidor".

3
00:00:04,444 --> 00:00:05,530
[Música]

All the best,

Hello, newbieAuggie2019.

Thanks for your reply. I have been away and for some reason didn't get a notification for this message. I understand what you mean by this, it's very common among forums and other interaction environments to avoid uppercase writing. However, this is part of the style of the TV channel and their request, so this is how it has to be done for them. When working with subtitles and captions, it all depends on what the client needs and prefers. In fact, sometimes a client will make a request that doesn't help with the readability—which I don't think is the case here.

When dealing with subtitles, I personally don't consider this as shouting; when you want to convey that, you use a sound effects tag. Probably the client wants to make sure the captions are read by everyone irrespective of how far they are from the TV. I'm not sure, though.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Search for multiple unknown 3 (2) Byte combinations in a file.	lastyle	7	3,284	Aug-14-2023, 02:28 AM Last Post: deanhystad
	UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord	ctrldan	23	9,556	Apr-24-2023, 03:40 PM Last Post: ctrldan
	UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont	Melcu54	3	11,051	Mar-26-2023, 12:12 PM Last Post: Gribouillis
	Decode string ?	JohnnyCoffee	1	1,463	Jan-11-2023, 12:29 AM Last Post: bowlofred
	extract only text strip byte array	Pir8Radio	7	7,037	Nov-29-2022, 10:24 PM Last Post: Pir8Radio
	[SOLVED] [Debian] UnicodeEncodeError: 'ascii' codec	Winfried	1	1,672	Nov-16-2022, 11:41 AM Last Post: Winfried
	sending byte in code?	korenron	2	1,890	Oct-30-2022, 01:14 PM Last Post: korenron
	UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34: character	Melcu54	7	29,318	Sep-26-2022, 10:09 AM Last Post: Melcu54
	Byte Error when working with APIs	Oshadha	2	1,667	Jul-05-2022, 05:23 AM Last Post: deanhystad
	UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in position 14: ordin	Armandito	6	4,377	Apr-29-2022, 12:36 PM Last Post: Armandito

'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte

User Panel Messages

Announcements