Python Forum
'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte
#1
Hi everyone,

I'm getting this error and have been looking online but don't really understand for my specific case and don't really know why this could be happening.

This is the error : 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte

I'm trying to read a text file with the following lines

inFile = open(fileName, 'r', encoding="utf8")
fileList = []
for line in inFile:
    fileList.append(line)
What I'm reading is a simple SRT file. I created a program that takes an SRT file and fixes the timestamps to eliminate overlapping because the editor does this sometimes. This function does this correctly and doesn't have this problem when reading and, when I create the new, corrected file, I'm just copying the old file and replacing the lines with timestamps with the corrected ones. However, when I try to read the newly generated file to do a conversion to another format I have this problem. I've been working with functions that convert and manipulate this kind of files for a while, but I had never gotten this error, just a similar one that I can't remember now, that's why I used the encoding="utf8".

I don't really know what "position 184" means, none of the lines is even longer than 33 characters, and line 184 of the file is an empty line with only an EOL character.

I'm thinking it's the new timestamps I'm writing that have this problem, but have no clue which character may be. When I look for the character 0xda, I find it's a Ú; however, that character is being read normally in other instances and I'm not even overwriting it.

If some of you happen to have not seen an SRT file before, it looks like this:

9
00:00:15,377 --> 00:00:18,570
ESTAMOS HACIENDO
UN FASCINANTE EXPERIMENTO.

10
00:00:19,150 --> 00:00:20,280
AÚN LO ESCUCHO.


The lines where I do the replacement are the following:


inList[line] = hoursBegin + ':' + minutesBegin + ':' +  secondsBegin + ',' + millisecondsBegin + ' --> ' +\
            hoursEnd + ':' + minutesEnd + ':' + secondsEnd + ',' + millisecondsEnd + '\n'
Thanks in advance.

PD: Please excuse me if I'm not being very clear about some things, just let me know and I'll clarify. I've been working for long hours and I'm kind of stuck.
Reply
#2
Try detect encoding chardet of file.
If i save test from your example,it will work as i always save files in utf-8.
with open('test.srt', encoding='utf-8f') as f:
    file_list = [i.strip() for i in f]
Output:
>>> file_list ['9', '00:00:15,377 --> 00:00:18,570', 'ESTAMOS HACIENDO', 'UN FASCINANTE EXPERIMENTO.', '', '10', '00:00:19,150 --> 00:00:20,280', 'AÚN LO ESCUCHO.']
Can also read in with utf-8 and errors='ignore' or errors='replace'.
with open('test.srt', encoding='utf-8f', errors='replace') as f:
    file_list = [i.strip() for i in f]
Quote:The lines where I do the replacement are the following:
Use string formatting f-string,then it look much nicer than all +.
hoursBegin = '12'
minutesBegin = '55'
print(f'{hoursBegin}:{minutesBegin}')
Output:
12:55
Reply
#3
Thank you very much for your very complete reply, snippsat.

Quote:Try detect encoding chardet of file.

This has taken me some time and I haven't found a way to figure it out. I tried this:

import chardet
rawData=open('1_Consumer Reports 02.srt',"r").read()
rawDataBytes = bytes(rawData, 'utf-8')
chardet.detect(rawDataBytes)
with the result being

Output:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
But then, with other tests, I've realized that chardet.detect is actually detecting the encoding I pass to the bytes function.

How can I pass the raw data in binary format of a read file without "altering" the encoding, which is what this appears to be doing? I got several errors before I made this "work."



Quote:If i save test from your example,it will work as i always save files in utf-8.

I tried this and realized that the excerpt I pasted and you used to run this test, doesn't raise this exception. By the way, the 'f' after the '8' in the encoding raises another itself, I guess it's just a typo.

with open('test.srt', encoding='utf-8f') as f:
    file_list = [i.strip() for i in f]
This excerpt does raise the exact same exception.

1
00:00:00,066 --> 00:00:01,888
HOLA, SOY <i>JACK RICO,</i>

2
00:00:01,888 --> 00:00:04,444
<i>Y ESTO ES </i>"TALLER
DEL CONSUMIDOR".

3
00:00:04,444 --> 00:00:05,530
<i>[MÚSICA]</i>


* The italics were just to try something with the converter I'm writing. I'm pasting the files as is.

Quote:Can also read in with utf-8 and errors='ignore' or errors='replace'.

I tried this and realized the problem is with the Ú in 'MÚSICA' in the third subtitle. However, doing this alters the file and it won't be correctly converted.

What I still don't understand is why, if I'm not rewriting that character and it's read normally by the first function that uses that file, it gets messed up and then, even if I read with 'utf-8', I still get this error.


Quote:Use string formatting f-string,then it look much nicer than all +.

This is great. I've been reading a little about it and, from that and what I can understand from your example, I don't even need the quotes along with the +. I don't even need to convert ints to strings, right?

Thank you very much for your examples and explanations.
Reply
#4
(Sep-06-2019, 02:02 PM)karkas Wrote: This has taken me some time and I haven't found a way to figure it out. I tried this:
It also install chardetect that can be command line(cmd) or cmder as i use.
C:\code
λ chardetect myfile.txt
myfile.txt: ascii with confidence 1.0
Quote:This excerpt does raise the exact same exception.
It should not if the file is saved as utf-8.
# copy of text in post and save as utf-8
λ chardetect test.srt
test.srt: utf-8 with confidence 0.505
Test.
λ python -V
Python 3.7.3

# Code used
E:\div_code
λ cat uni_music.py
with open('test.srt', encoding='utf-8') as f:
    file_list = [i.strip() for i in f]

# Run code interactively 
E:\div_code
λ ptpython -i uni_music.py
>>> from pprint import pprint

>>> pprint(file_list)
['1',
 '00:00:00,066 --> 00:00:01,888',
 'HOLA, SOY <i>JACK RICO,</i>',
 '',
 '2',
 '00:00:01,888 --> 00:00:04,444',
 '<i>Y ESTO ES </i>"TALLER',
 'DEL CONSUMIDOR".',
 '',
 '3',
 '00:00:04,444 --> 00:00:05,530',
 '<i>[MÚSICA]</i>']

# Last element,all is correct
>>> file_list[-1]
'<i>[MÚSICA]</i>'
Quote:How can I pass the raw data in binary format of a read file without "altering" the encoding
you use 'rb',but then still need to decode or the Unicode will look like this M\xc3\x9aSICA.
with open('test.srt', 'rb') as f:
    file_list = [i.strip() for i in f]
Output:
>>> file_list [b'1', b'00:00:00,066 --> 00:00:01,888', b'HOLA, SOY <i>JACK RICO,</i>', b'', b'2', b'00:00:01,888 --> 00:00:04,444', b'<i>Y ESTO ES </i>"TALLER', b'DEL CONSUMIDOR".', b'', b'3', b'00:00:04,444 --> 00:00:05,530', b'<i>[M\xc3\x9aSICA]</i>']
>>> file_list[-1]
b'<i>[M\xc3\x9aSICA]</i>'

>>> file_list[-1].decode() # Same as decode('utf-8') this is default
'<i>[MÚSICA]</i>'
Reply
#5
Okay, I found the source of the problem and the solution and, honestly, it was a bit dumb. I just didn't look very well into that.

When I was opening the new file for writing, I had done:

newFile = open(fileName, 'w+')
which used I don't know what encoding scheme. After opening as 'utf-8' for writing, the file works well with the subsequent converter.
For some reason, I wrongly assumed—unconsciously, actually—that after reading the contents with that encoding, they will be written in the same way to a new file.

Either way, you have taught me a lot here.

About your examples, they go beyond my understanding and knowledge at the time.

- C:\code would be a path to my code file?

- What does that λ mean?

- What is the difference between your first and second example? Just the same with different encoding depending on how you save the file?

I think I have too many questions about the test code.


Thank you very much, again.
Reply
#6
(Sep-07-2019, 02:46 AM)karkas Wrote: After opening as 'utf-8' for writing, the file works well with the subsequent converter.
Yes,remember utf-8 out and in,OS and local system can mess up stuff and choice wrong encoding.
newFile = open(fileName, 'w+', encoding='utf-8')
Quote:- C:\code would be a path to my code file?
No this is where i placed the file,you can choose whatever place you like.
Rember that python and pip should work from any folder same for all OS,for Windows look here
So a test like this should work no matter which folder on OS you are in.
E:\div_code\click
λ python -V
Python 3.7.3

E:\div_code\click
λ pip -V
pip 19.2.3 from c:\python37\lib\site-packages\pip (python 3.7)

Quote:- What does that λ mean?
That use cmder a much better shell than cmd/Poweshell.

Quote:- What is the difference between your first and second example? Just the same with different encoding depending on how you save the file?
When i use rb then i read file in binary(no encoding).
With Unicode still need to decode to a encoding.
>>> s = b'<i>[M\xc3\x9aSICA]</i>'
>>> s
b'<i>[M\xc3\x9aSICA]</i>'
>>> type(s)
<class 'bytes'>
>>> 
>>> t = s.decode()
>>> t
'<i>[MÚSICA]</i>'
>>> type(t)
<class 'str'>
Reply
#7
Quote:No this is where i placed the file,you can choose whatever place you like.
Rember that python and pip should work from any folder same for all OS,for Windows look here
So a test like this should work no matter which folder on OS you are in.

Oh yes, I actually learned this when following your advice in the first reply to use chardet. I had to add the python path to the environmental variables and then install pip to be able to install chardet.

Quote:That use cmder a much better shell than cmd/Poweshell.

That would be another thing for me to look at and learn.


Quote:When i use rb then i read file in binary(no encoding).
With Unicode still need to decode to a encoding.

I actually meant these two examples:

C:\code
λ chardetect myfile.txt
myfile.txt: ascii with confidence 1.0
# copy of text in post and save as utf-8
λ chardetect test.srt
test.srt: utf-8 with confidence 0.505
Thank you very much for your help and all your insights. This has been great because it's given me a lot more to study and learn.

Best regards.
Reply
#8
(Sep-02-2019, 09:45 PM)karkas Wrote: 9
00:00:15,377 --> 00:00:18,570
ESTAMOS HACIENDO
UN FASCINANTE EXPERIMENTO.

10
00:00:19,150 --> 00:00:20,280
AÚN LO ESCUCHO.

(Sep-06-2019, 02:02 PM)karkas Wrote: 1
00:00:00,066 --> 00:00:01,888
HOLA, SOY <i>JACK RICO,</i>

2
00:00:01,888 --> 00:00:04,444
<i>Y ESTO ES </i>"TALLER
DEL CONSUMIDOR".

3
00:00:04,444 --> 00:00:05,530
<i>[MÚSICA]</i>

Hi!

I think you are dealing with subtitles, and although it's a personal option, I would like to comment on something off your question. It's about formatting (by the way, you asked also about 'f'. It's also related to formatting strings).

I personally (so you can completely avoid my suggestions) find UPPERCASE LETTERS in subtitles (like in a chat), as if somebody is shouting, instead of speaking (actually, when lowercase letters are used, sometimes UPPERCASE LETTERS are then used to mean that somebody is shouting or emphasizing something).

Therefore, I would personally use something like the following, instead of what you provide as an example (of course, you can completely ignore my advice):
Quote:9
00:00:15,377 --> 00:00:18,570
Estamos haciendo
un experimento fascinante.

10
00:00:19,150 --> 00:00:20,280
Todavía lo escucho.

Quote:1
00:00:00,066 --> 00:00:01,888
¡Hola! Soy <i>Jack Rico</i>,

2
00:00:01,888 --> 00:00:04,444
<i>y esto es </i>"Taller
del Consumidor".

3
00:00:04,444 --> 00:00:05,530
<i>[Música]</i>

All the best,
newbieAuggie2019

"That's been one of my mantras - focus and simplicity. Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it's worth it in the end because once you get there, you can move mountains."
Steve Jobs
Reply
#9
(Sep-12-2019, 11:19 PM)newbieAuggie2019 Wrote:
(Sep-02-2019, 09:45 PM)karkas Wrote: 9
00:00:15,377 --> 00:00:18,570
ESTAMOS HACIENDO
UN FASCINANTE EXPERIMENTO.

10
00:00:19,150 --> 00:00:20,280
AÚN LO ESCUCHO.

(Sep-06-2019, 02:02 PM)karkas Wrote: 1
00:00:00,066 --> 00:00:01,888
HOLA, SOY <i>JACK RICO,</i>

2
00:00:01,888 --> 00:00:04,444
<i>Y ESTO ES </i>"TALLER
DEL CONSUMIDOR".

3
00:00:04,444 --> 00:00:05,530
<i>[MÚSICA]</i>

Hi!

I think you are dealing with subtitles, and although it's a personal option, I would like to comment on something off your question. It's about formatting (by the way, you asked also about 'f'. It's also related to formatting strings).

I personally (so you can completely avoid my suggestions) find UPPERCASE LETTERS in subtitles (like in a chat), as if somebody is shouting, instead of speaking (actually, when lowercase letters are used, sometimes UPPERCASE LETTERS are then used to mean that somebody is shouting or emphasizing something).

Therefore, I would personally use something like the following, instead of what you provide as an example (of course, you can completely ignore my advice):
Quote:9
00:00:15,377 --> 00:00:18,570
Estamos haciendo
un experimento fascinante.

10
00:00:19,150 --> 00:00:20,280
Todavía lo escucho.

Quote:1
00:00:00,066 --> 00:00:01,888
¡Hola! Soy <i>Jack Rico</i>,

2
00:00:01,888 --> 00:00:04,444
<i>y esto es </i>"Taller
del Consumidor".

3
00:00:04,444 --> 00:00:05,530
<i>[Música]</i>

All the best,

Hello, newbieAuggie2019.

Thanks for your reply. I have been away and for some reason didn't get a notification for this message. I understand what you mean by this, it's very common among forums and other interaction environments to avoid uppercase writing. However, this is part of the style of the TV channel and their request, so this is how it has to be done for them. When working with subtitles and captions, it all depends on what the client needs and prefers. In fact, sometimes a client will make a request that doesn't help with the readability—which I don't think is the case here.

When dealing with subtitles, I personally don't consider this as shouting; when you want to convey that, you use a sound effects tag. Probably the client wants to make sure the captions are read by everyone irrespective of how far they are from the TV. I'm not sure, though.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Search for multiple unknown 3 (2) Byte combinations in a file. lastyle 7 1,256 Aug-14-2023, 02:28 AM
Last Post: deanhystad
Question UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord ctrldan 23 4,601 Apr-24-2023, 03:40 PM
Last Post: ctrldan
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont Melcu54 3 4,700 Mar-26-2023, 12:12 PM
Last Post: Gribouillis
  Decode string ? JohnnyCoffee 1 786 Jan-11-2023, 12:29 AM
Last Post: bowlofred
  extract only text strip byte array Pir8Radio 7 2,789 Nov-29-2022, 10:24 PM
Last Post: Pir8Radio
  [SOLVED] [Debian] UnicodeEncodeError: 'ascii' codec Winfried 1 988 Nov-16-2022, 11:41 AM
Last Post: Winfried
  sending byte in code? korenron 2 1,087 Oct-30-2022, 01:14 PM
Last Post: korenron
  UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34: character Melcu54 7 18,304 Sep-26-2022, 10:09 AM
Last Post: Melcu54
  Byte Error when working with APIs Oshadha 2 980 Jul-05-2022, 05:23 AM
Last Post: deanhystad
  UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in position 14: ordin Armandito 6 2,643 Apr-29-2022, 12:36 PM
Last Post: Armandito

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020