Python Forum
'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte
#3
Thank you very much for your very complete reply, snippsat.

Quote:Try detect encoding chardet of file.

This has taken me some time and I haven't found a way to figure it out. I tried this:

import chardet
rawData=open('1_Consumer Reports 02.srt',"r").read()
rawDataBytes = bytes(rawData, 'utf-8')
chardet.detect(rawDataBytes)
with the result being

Output:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
But then, with other tests, I've realized that chardet.detect is actually detecting the encoding I pass to the bytes function.

How can I pass the raw data in binary format of a read file without "altering" the encoding, which is what this appears to be doing? I got several errors before I made this "work."



Quote:If i save test from your example,it will work as i always save files in utf-8.

I tried this and realized that the excerpt I pasted and you used to run this test, doesn't raise this exception. By the way, the 'f' after the '8' in the encoding raises another itself, I guess it's just a typo.

with open('test.srt', encoding='utf-8f') as f:
    file_list = [i.strip() for i in f]
This excerpt does raise the exact same exception.

1
00:00:00,066 --> 00:00:01,888
HOLA, SOY <i>JACK RICO,</i>

2
00:00:01,888 --> 00:00:04,444
<i>Y ESTO ES </i>"TALLER
DEL CONSUMIDOR".

3
00:00:04,444 --> 00:00:05,530
<i>[MÚSICA]</i>


* The italics were just to try something with the converter I'm writing. I'm pasting the files as is.

Quote:Can also read in with utf-8 and errors='ignore' or errors='replace'.

I tried this and realized the problem is with the Ú in 'MÚSICA' in the third subtitle. However, doing this alters the file and it won't be correctly converted.

What I still don't understand is why, if I'm not rewriting that character and it's read normally by the first function that uses that file, it gets messed up and then, even if I read with 'utf-8', I still get this error.


Quote:Use string formatting f-string,then it look much nicer than all +.

This is great. I've been reading a little about it and, from that and what I can understand from your example, I don't even need the quotes along with the +. I don't even need to convert ints to strings, right?

Thank you very much for your examples and explanations.
Reply


Messages In This Thread
RE: 'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte - by karkas - Sep-06-2019, 02:02 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Search for multiple unknown 3 (2) Byte combinations in a file. lastyle 7 1,381 Aug-14-2023, 02:28 AM
Last Post: deanhystad
Question UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord ctrldan 23 4,891 Apr-24-2023, 03:40 PM
Last Post: ctrldan
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont Melcu54 3 5,030 Mar-26-2023, 12:12 PM
Last Post: Gribouillis
  Decode string ? JohnnyCoffee 1 833 Jan-11-2023, 12:29 AM
Last Post: bowlofred
  extract only text strip byte array Pir8Radio 7 3,003 Nov-29-2022, 10:24 PM
Last Post: Pir8Radio
  [SOLVED] [Debian] UnicodeEncodeError: 'ascii' codec Winfried 1 1,037 Nov-16-2022, 11:41 AM
Last Post: Winfried
  sending byte in code? korenron 2 1,134 Oct-30-2022, 01:14 PM
Last Post: korenron
  UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34: character Melcu54 7 19,071 Sep-26-2022, 10:09 AM
Last Post: Melcu54
  Byte Error when working with APIs Oshadha 2 1,022 Jul-05-2022, 05:23 AM
Last Post: deanhystad
  UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in position 14: ordin Armandito 6 2,745 Apr-29-2022, 12:36 PM
Last Post: Armandito

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020