Python Forum

Full Version: utf-8 error with pandas read_csv
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm trying to read in several large data files (~600-700k rows) as dataframes so I can clean and append them to create a large panel dataset. When I'm importing, I get the following error
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 4: invalid continuation byte
When I restrict to nrows=5000, the read works, but somewhere between 5000 and 6000 rows, the error happens again. There isn't anything wrong with the file, and I've had no issues importing it, and the other files, into R. Here's the link to the publicly available .xslx file that I converted into a CSV before reading into Python: https://www.foreignlaborcert.doleta.gov/..._FY17.xlsx. Thanks in advance for your help in getting this issue resolved!

import pandas as pd
df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv")
Output:
df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv", nrows = 5900) Traceback (most recent call last): File "<ipython-input-4-c62aa366fb87>", line 1, in <module> df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv", nrows = 5900) File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f return _read(filepath_or_buffer, kwds) File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 446, in _read data = parser.read(nrows) File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read ret = self._engine.read(nrows) File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read data = self._reader.read(nrows) File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 4: invalid continuation byte