Python Forum

I'm trying to read in several large data files (~600-700k rows) as dataframes so I can clean and append them to create a large panel dataset. When I'm importing, I get the following error

Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 4: invalid continuation byte

When I restrict to nrows=5000, the read works, but somewhere between 5000 and 6000 rows, the error happens again. There isn't anything wrong with the file, and I've had no issues importing it, and the other files, into R. Here's the link to the publicly available .xslx file that I converted into a CSV before reading into Python: https://www.foreignlaborcert.doleta.gov/..._FY17.xlsx. Thanks in advance for your help in getting this issue resolved!

import pandas as pd
df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv")

Output:df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv", nrows = 5900)
Traceback (most recent call last):

  File "<ipython-input-4-c62aa366fb87>", line 1, in <module>
    df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv", nrows = 5900)

  File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 446, in _read
    data = parser.read(nrows)

  File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)

  File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)

  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read

  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory

  File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows

  File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data

  File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens

  File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype

  File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert

  File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 4: invalid continuation byte

logues