Thank you very much for your very complete reply, snippsat.
This has taken me some time and I haven't found a way to figure it out. I tried this:
How can I pass the raw data in binary format of a read file without "altering" the encoding, which is what this appears to be doing? I got several errors before I made this "work."
I tried this and realized that the excerpt I pasted and you used to run this test, doesn't raise this exception. By the way, the 'f' after the '8' in the encoding raises another itself, I guess it's just a typo.
1
00:00:00,066 --> 00:00:01,888
HOLA, SOY <i>JACK RICO,</i>
2
00:00:01,888 --> 00:00:04,444
<i>Y ESTO ES </i>"TALLER
DEL CONSUMIDOR".
3
00:00:04,444 --> 00:00:05,530
<i>[MÚSICA]</i>
* The italics were just to try something with the converter I'm writing. I'm pasting the files as is.
I tried this and realized the problem is with the Ú in 'MÚSICA' in the third subtitle. However, doing this alters the file and it won't be correctly converted.
What I still don't understand is why, if I'm not rewriting that character and it's read normally by the first function that uses that file, it gets messed up and then, even if I read with 'utf-8', I still get this error.
This is great. I've been reading a little about it and, from that and what I can understand from your example, I don't even need the quotes along with the +. I don't even need to convert ints to strings, right?
Thank you very much for your examples and explanations.
Quote:Try detect encoding chardet of file.
This has taken me some time and I haven't found a way to figure it out. I tried this:
import chardet rawData=open('1_Consumer Reports 02.srt',"r").read() rawDataBytes = bytes(rawData, 'utf-8') chardet.detect(rawDataBytes)with the result being
Output:{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
But then, with other tests, I've realized that chardet.detect
is actually detecting the encoding I pass to the bytes
function.How can I pass the raw data in binary format of a read file without "altering" the encoding, which is what this appears to be doing? I got several errors before I made this "work."
Quote:If i save test from your example,it will work as i always save files in utf-8.
I tried this and realized that the excerpt I pasted and you used to run this test, doesn't raise this exception. By the way, the 'f' after the '8' in the encoding raises another itself, I guess it's just a typo.
with open('test.srt', encoding='utf-8f') as f: file_list = [i.strip() for i in f]This excerpt does raise the exact same exception.
1
00:00:00,066 --> 00:00:01,888
HOLA, SOY <i>JACK RICO,</i>
2
00:00:01,888 --> 00:00:04,444
<i>Y ESTO ES </i>"TALLER
DEL CONSUMIDOR".
3
00:00:04,444 --> 00:00:05,530
<i>[MÚSICA]</i>
* The italics were just to try something with the converter I'm writing. I'm pasting the files as is.
Quote:Can also read in with utf-8 anderrors='ignore'
orerrors='replace'
.
I tried this and realized the problem is with the Ú in 'MÚSICA' in the third subtitle. However, doing this alters the file and it won't be correctly converted.
What I still don't understand is why, if I'm not rewriting that character and it's read normally by the first function that uses that file, it gets messed up and then, even if I read with 'utf-8', I still get this error.
Quote:Use string formattingf-string
,then it look much nicer than all+
.
This is great. I've been reading a little about it and, from that and what I can understand from your example, I don't even need the quotes along with the +. I don't even need to convert ints to strings, right?
Thank you very much for your examples and explanations.