'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte

karkas · (This post was last modified: Sep-06-2019, 02:02 PM by karkas.)

Thank you very much for your very complete reply, snippsat.

Quote:Try detect encoding chardet of file.

This has taken me some time and I haven't found a way to figure it out. I tried this:

import chardet
rawData=open('1_Consumer Reports 02.srt',"r").read()
rawDataBytes = bytes(rawData, 'utf-8')
chardet.detect(rawDataBytes)

with the result being

Output:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

But then, with other tests, I've realized that chardet.detect is actually detecting the encoding I pass to the bytes function.

How can I pass the raw data in binary format of a read file without "altering" the encoding, which is what this appears to be doing? I got several errors before I made this "work."

Quote:If i save test from your example,it will work as i always save files in utf-8.

I tried this and realized that the excerpt I pasted and you used to run this test, doesn't raise this exception. By the way, the 'f' after the '8' in the encoding raises another itself, I guess it's just a typo.

with open('test.srt', encoding='utf-8f') as f:
    file_list = [i.strip() for i in f]

This excerpt does raise the exact same exception.

1
00:00:00,066 --> 00:00:01,888
HOLA, SOY JACK RICO,

2
00:00:01,888 --> 00:00:04,444
Y ESTO ES "TALLER
DEL CONSUMIDOR".

3
00:00:04,444 --> 00:00:05,530
[MÚSICA]

* The italics were just to try something with the converter I'm writing. I'm pasting the files as is.

Quote:Can also read in with utf-8 and errors='ignore' or errors='replace'.

I tried this and realized the problem is with the Ú in 'MÚSICA' in the third subtitle. However, doing this alters the file and it won't be correctly converted.

What I still don't understand is why, if I'm not rewriting that character and it's read normally by the first function that uses that file, it gets messed up and then, even if I read with 'utf-8', I still get this error.

Quote:Use string formatting f-string,then it look much nicer than all +.

This is great. I've been reading a little about it and, from that and what I can understand from your example, I don't even need the quotes along with the +. I don't even need to convert ints to strings, right?

Thank you very much for your examples and explanations.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Search for multiple unknown 3 (2) Byte combinations in a file.	lastyle	7	1,381	Aug-14-2023, 02:28 AM Last Post: deanhystad
	UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord	ctrldan	23	4,891	Apr-24-2023, 03:40 PM Last Post: ctrldan
	UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont	Melcu54	3	5,030	Mar-26-2023, 12:12 PM Last Post: Gribouillis
	Decode string ?	JohnnyCoffee	1	833	Jan-11-2023, 12:29 AM Last Post: bowlofred
	extract only text strip byte array	Pir8Radio	7	3,003	Nov-29-2022, 10:24 PM Last Post: Pir8Radio
	[SOLVED] [Debian] UnicodeEncodeError: 'ascii' codec	Winfried	1	1,037	Nov-16-2022, 11:41 AM Last Post: Winfried
	sending byte in code?	korenron	2	1,134	Oct-30-2022, 01:14 PM Last Post: korenron
	UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34: character	Melcu54	7	19,071	Sep-26-2022, 10:09 AM Last Post: Melcu54
	Byte Error when working with APIs	Oshadha	2	1,022	Jul-05-2022, 05:23 AM Last Post: deanhystad
	UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in position 14: ordin	Armandito	6	2,745	Apr-29-2022, 12:36 PM Last Post: Armandito

'utf-8' codec can't decode byte 0xda in position 184: invalid continuation byte

User Panel Messages

Announcements