Python Forum
UnicodeDecodeError: - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: UnicodeDecodeError: (/thread-9460.html)

Pages: 1 2


UnicodeDecodeError: - garikhgh0 - Apr-10-2018

Hi.
how to handle this?
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 16: invalid start byte.

thanks a lot.


RE: UnicodeDecodeError: - Gribouillis - Apr-10-2018

You're perhaps trying to decode a string which was not encoded in the utf8 encoding. What do you know about this string? You can try other encodings such as iso9959-1 or cp1252. The chardet module can help you guess the string's encoding.


RE: UnicodeDecodeError: - garikhgh0 - Apr-10-2018

hi, I have dowloaded it from my SAS EG, it is in txt format, when I read the first 100 rows it works, but when try the first 100000 it rises the error

   Rep_Date    Item_Name    Item_Catalog_name  Warehouse  Qty    Rep_item
0  2016-02-01  ALO           Pre                A RSC      13    Sales
1  2016-02-01  ALO           Pre                B RSC       3    Sales
2  2016-02-01  ALo           Pre                C RSC       2    Sales
3  2016-02-01  ALO           Pre                D RSC      13    Sales
4  2016-02-01  ALO           Pre                F RSC       9    Sales



RE: UnicodeDecodeError: - Gribouillis - Apr-10-2018

By bisection, it should be easy to find the first faulty line in the file. Try with the first 50000 lines etc until you can locate and print the faulty line.


RE: UnicodeDecodeError: - garikhgh0 - Apr-10-2018

it says that it is in possition 31

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 31: invalid start byte


RE: UnicodeDecodeError: - Gribouillis - Apr-10-2018

(Apr-10-2018, 10:32 AM)garikhgh0 Wrote: it says that it is in possition 31
Position 31 in which string? The whole text file or some current line during the reading of the file? Can you post the complete exception traceback?


RE: UnicodeDecodeError: - garikhgh0 - Apr-10-2018

Traceback (most recent call last):
  File "pandas\_libs\parsers.pyx", line 1175, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1297, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1539, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 31: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\garhakobyan\Desktop\My Project_on_Python\reading_675.py", line 4, in <module>
    df = pd.read_csv('R675.csv', encoding  = 'utf_8')
  File "C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 455, in _read
    data = parser.read(nrows)
  File "C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 1069, in read
    ret = self._engine.read(nrows)
  File "C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\parsers.py", line 1839, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 1001, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 1130, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas\_libs\parsers.pyx", line 1182, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1297, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1539, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 31: invalid start byte
[Finished in 1.5s with exit code 1]
[shell_cmd: python -u "C:\Users\garhakobyan\Desktop\My Project_on_Python\reading_675.py"]
[dir: C:\Users\garhakobyan\Desktop\My Project_on_Python]
[path: C:\ProgramData\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\SAS\SharedFiles(32)\Formats;C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\Scripts\;C:\Users\garhakobyan\AppData\Local\Programs\Python\Python36-32\]



RE: UnicodeDecodeError: - Gribouillis - Apr-10-2018

You can track the error by running this code (you may need to change the path to the csv file)
# foo.py
CSVFILE = 'R675.csv'
with open(CSVFILEi, 'rb') as ifh:
    for i, line in enumerate(ifh, 1):
        try:
            s = line.decode('utf-8')
        except UnicodeDecodeError as err:
            print('R675.csv: ERROR AT LINE', i, repr(line))
            break



RE: UnicodeDecodeError: - snippsat - Apr-10-2018

You should mention that you use pandas.
Do read it as utf-8?
Post your code with a sample of CSV where error is.
import pandas as pd

df = pd.read_csv('file_name.csv', encoding='utf-8')
Same with code @Grib has posted,it's an option to set encoding.
with open(CSVFILE, encoding='utf-8') as ifh:
There can also ignore or replace error.
with open(CSVFILE, encoding='utf-8', errors='ignore') as ifh:



RE: UnicodeDecodeError: - garikhgh0 - Apr-10-2018

it gave this

R675.csv: ERROR AT LINE 10538 b'2018-03-26,HQ Service Center,Handset,Samsung Galaxy S9+ and Charger \xffBlack,1,Sales\r\n'

it worked, I have cleared white spaces between "Charger" and "Black". but could not understand how it can affect to the reading process when there were nothing.
id there a method to show the hidden symbols, which cause the problem.


thanks a lot :)