Posts: 38
Threads: 17
Joined: Jan 2018
Hi.
how to handle this?
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 16: invalid start byte.
thanks a lot.
Posts: 4,488
Threads: 69
Joined: Jan 2018
Apr-10-2018, 08:16 AM
(This post was last modified: Apr-10-2018, 08:17 AM by Gribouillis.)
You're perhaps trying to decode a string which was not encoded in the utf8 encoding. What do you know about this string? You can try other encodings such as iso9959-1 or cp1252. The chardet module can help you guess the string's encoding.
Posts: 38
Threads: 17
Joined: Jan 2018
hi, I have dowloaded it from my SAS EG, it is in txt format, when I read the first 100 rows it works, but when try the first 100000 it rises the error
Rep_Date Item_Name Item_Catalog_name Warehouse Qty Rep_item
0 2016-02-01 ALO Pre A RSC 13 Sales
1 2016-02-01 ALO Pre B RSC 3 Sales
2 2016-02-01 ALo Pre C RSC 2 Sales
3 2016-02-01 ALO Pre D RSC 13 Sales
4 2016-02-01 ALO Pre F RSC 9 Sales
Posts: 4,488
Threads: 69
Joined: Jan 2018
Apr-10-2018, 10:00 AM
(This post was last modified: Apr-10-2018, 10:01 AM by Gribouillis.)
By bisection, it should be easy to find the first faulty line in the file. Try with the first 50000 lines etc until you can locate and print the faulty line.
Posts: 38
Threads: 17
Joined: Jan 2018
it says that it is in possition 31
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 31: invalid start byte
Posts: 4,488
Threads: 69
Joined: Jan 2018
Apr-10-2018, 11:40 AM
(This post was last modified: Apr-10-2018, 11:40 AM by Gribouillis.)
(Apr-10-2018, 10:32 AM)garikhgh0 Wrote: it says that it is in possition 31 Position 31 in which string? The whole text file or some current line during the reading of the file? Can you post the complete exception traceback?
Posts: 38
Threads: 17
Joined: Jan 2018
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1175, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1297, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1539, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 31: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\garhakobyan\Desktop\My Project_on_Python\reading_675.py", line 4, in <module>
df = pd.read_csv('R675.csv', encoding = 'utf_8')
File "C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 455, in _read
data = parser.read(nrows)
File "C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 1069, in read
ret = self._engine.read(nrows)
File "C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 1839, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 1001, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1130, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1182, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1297, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1539, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 31: invalid start byte
[Finished in 1.5s with exit code 1]
[shell_cmd: python -u "C:\Users\garhakobyan\Desktop\My Project_on_Python\reading_675.py"]
[dir: C:\Users\garhakobyan\Desktop\My Project_on_Python]
[path: C:\ProgramData\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\SAS\SharedFiles(32)\Formats;C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\Scripts\;C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\]
Posts: 4,488
Threads: 69
Joined: Jan 2018
Apr-10-2018, 12:22 PM
(This post was last modified: Apr-10-2018, 12:46 PM by Gribouillis.)
You can track the error by running this code (you may need to change the path to the csv file)
# foo.py
CSVFILE = 'R675.csv'
with open(CSVFILEi, 'rb') as ifh:
for i, line in enumerate(ifh, 1):
try:
s = line.decode('utf-8')
except UnicodeDecodeError as err:
print('R675.csv: ERROR AT LINE', i, repr(line))
break
Posts: 7,080
Threads: 122
Joined: Sep 2016
You should mention that you use pandas.
Do read it as utf-8 ?
Post your code with a sample of CSV where error is.
import pandas as pd
df = pd.read_csv('file_name.csv', encoding='utf-8') Same with code @Grib has posted,it's an option to set encoding.
with open(CSVFILE, encoding='utf-8') as ifh: There can also ignore or replace error.
with open(CSVFILE, encoding='utf-8', errors='ignore') as ifh:
Posts: 38
Threads: 17
Joined: Jan 2018
Apr-10-2018, 01:02 PM
(This post was last modified: Apr-10-2018, 01:13 PM by garikhgh0.)
it gave this
R675.csv: ERROR AT LINE 10538 b'2018-03-26,HQ Service Center,Handset,Samsung Galaxy S9+ and Charger \xffBlack,1,Sales\r\n'
it worked, I have cleared white spaces between "Charger" and "Black". but could not understand how it can affect to the reading process when there were nothing.
id there a method to show the hidden symbols, which cause the problem.
thanks a lot :)
|