read_csv error and rows/columns missing

karlito · Nov-05-2019, 10:04 AM

I thought I got everything right till I ended up with a different file to read. Damn for 2 days I'm struggling with it. Help, please.
link to the data:file.zip

Thks for your time

# Location of all files 
file_folder = 'path_on_your_computer'

# Save the files into a list (when more than 2)
list_raw_files = [f for f in listdir(file_folder) if isfile(join(file_folder, f))]

# Load the right/given file
for raw_file in list_raw_files:   
    
    #  Check the file 
    if raw_file.startswith('130'): #print (raw_file)
        
        temp_list = []
        
        for chunk in pd.read_csv(file_folder + raw_file, sep = ';', header = None, chunksize = 20000, error_bad_lines = False , low_memory=False):
            
            temp_list.append(chunk)
        
        data = pd.concat(temp_list, axis = 0)
        
        del temp_list

data.head(30)

I have this Error/Warning:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

and the data is not complete!
link to my result: output

**Larz60+** · (This post was last modified: Nov-05-2019, 03:28 PM by Larz60+.)

zip file is corrupt
to get list of files in directory, use:

from pathlib import Path
from tkinter import filedialog as fd

def get_dir_file_list():
    fdir = Path(fd.askdirectory())
    fl = [fn for fn in fdir.iterdir() if fn.is_file()]
    return fl

if __name__ == '__main__':
    filelist = get_dir_file_list()
    for filename in filelist:
        print(f"filename: {filename.name}")

karlito · Nov-07-2019, 06:47 AM

(Nov-05-2019, 03:28 PM)Larz60+ Wrote:

zip file is corrupt
to get list of files in directory, use:

from pathlib import Path
from tkinter import filedialog as fd

def get_dir_file_list():
    fdir = Path(fd.askdirectory())
    fl = [fn for fn in fdir.iterdir() if fn.is_file()]
    return fl

if __name__ == '__main__':
    filelist = get_dir_file_list()
    for filename in filelist:
        print(f"filename: {filename.name}")

Thks for your response and sorry for the late answer,
but to get a list of files in a dir is not my issue here but to load the file because I already get the file but reading it is the problem (if you see the pic)... the zip file is corrupted? how? I can unzip it and see the csv file

**Larz60+** · Nov-07-2019, 02:52 PM

Quote:the zip file is corrupted? how? I can unzip it and see the csv file

I cannot

kozaizsvemira · (This post was last modified: Nov-07-2019, 07:49 PM by kozaizsvemira.)

(Nov-05-2019, 10:04 AM)karlito Wrote: I thought I got everything right till I ended up with a different file to read. Damn for 2 days I'm struggling with it. Help, please.
link to the data:file.zip

Thks for your time
# Location of all files 
file_folder = 'path_on_your_computer'

# Save the files into a list (when more than 2)
list_raw_files = [f for f in listdir(file_folder) if isfile(join(file_folder, f))]

# Load the right/given file
for raw_file in list_raw_files:   
    
    #  Check the file 
    if raw_file.startswith('130'): #print (raw_file)
        
        temp_list = []
        
        for chunk in pd.read_csv(file_folder + raw_file, sep = ';', header = None, chunksize = 20000, error_bad_lines = False , low_memory=False):
            
            temp_list.append(chunk)
        
        data = pd.concat(temp_list, axis = 0)
        
        del temp_list

data.head(30)
I have this Error/Warning:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

and the data is not complete!
link to my result: output

Try this jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10 or try adding time.sleep(1) inside for loop

**Larz60+** · (This post was last modified: Nov-07-2019, 11:37 PM by Larz60+.)

got a good file.zip this time.

I was able to get this to run (with errors) by reducing the chunk size and a few other adjustments.
I have 32GB or memory, and a chunk size of 20,000 blew up.
if you can't get the code below to run (only one file), reduce chunk size until it does.
And don't forget, you're appending the entire file into a temporary list, so if you don't have enough memory, this will either blow up, or be stuck paging memory for a long time.

import os
import pandas as pd

# assure in proper directory
os.chdir(os.path.abspath(os.path.dirname(__file__)))

# for my test
list_raw_files = ['130000054.csv']

# Location of all files 
# file_folder = 'path_on_your_computer'
 
# # Save the files into a list (when more than 2)
# list_raw_files = [f for f in listdir(file_folder) if isfile(join(file_folder, f))]
 
# Load the right/given file
for raw_file in list_raw_files:   
     
    #  Check the file 
    if raw_file.startswith('130'): #print (raw_file)
         
        temp_list = []
        file_folder = './'
        for chunk in pd.read_csv(file_folder + raw_file, sep = ';', header = None, chunksize = 10000, error_bad_lines = False , low_memory=False):
             
            temp_list.append(chunk)
         
        data = pd.concat(temp_list, axis = 0)
         
        del temp_list
 
data.head(30)

partial error list:

Output:
b'Skipping line 10: expected 5 fields, saw 6\nSkipping line 11: expected 5 fields, saw 6\nSkipping line 13: expected 5 fields, saw 6\nSkipping line 15: expected 5 fields, saw 6\nSkipping line 17: expected 5 fields, saw 6\nSkipping line 23: expected 5 fields, saw 6\nSkipping line 24: expected 5 fields, saw 6\nSkipping line 25: expected 5 fields, saw 6\nSkipping line 27: expected 5 fields, saw 96\nSkipping line 28: expected 5 fields, saw 96\nSkipping line 29: expected 5 fields, saw 96\nSkipping line 30: expected 5 fields, saw 96\nSkipping line 31: expected 5 fields, saw 96\nSkipping line 32: expected 5 fields, saw 96\nSkipping line 33: expected 5 fields, saw 96\nSkipping line 34: expected 5 fields, saw 6\nSkipping line 35: expected 5 fields, saw 96\nSkipping line 36: expected 5 fields, saw 96\nSkipping line 37: expected 5 fields, saw 6\nSkipping line 38: expected 5 fields, saw 96\nSkipping line 39: expected 5 fields, saw 96\nSkipping line 40: expected 5 fields, saw 96\nSkipping line 41: expected 5 fields, saw 6\nSkipping line 42: expected 5 fields, saw 96\nSkipping line 43: expected 5

**Larz60+** · Nov-07-2019, 11:59 PM

Here's a sample of what the actual data looks like, and some code that will read it:
raw data sample:

Output:0631960537;  10.01.1990 09:35:37;Intern;0x2207;Batt-SN ungueltig
0631960539;  10.01.1990 09:35:39;CHRG;0x0400;Netzfrequenz zu klein
0631960539;  10.01.1990 09:35:39;CHRG;0x0402;Netzspg. zu klein
0631960539;  10.01.1990 09:35:39;CHRG;0x0607;UBatt Min
0631960539;  10.01.1990 09:35:39;CHRG;0x0616;Uzwk Min
0631963784;  10.01.1990 10:29:44;Intern;0x2207;Batt-SN ungueltig
0631963787;  10.01.1990 10:29:47;CHRG;0x0400;Netzfrequenz zu klein
0631963787;  10.01.1990 10:29:47;CHRG;0x0402;Netzspg. zu klein
0631963905;  10.01.1990 10:31:45;Intern;0x2207;Batt-SN ungueltig
0631963909;  10.01.1990 10:31:49;CHRG;0x0400;Netzfrequenz zu klein;631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342
0631963909;  10.01.1990 10:31:49;CHRG;0x0402;Netzspg. zu klein;631963909: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 342
0631964140;  10.01.1990 10:35:40;Intern;0x2207;Batt-SN ungueltig
0631964182;  10.01.1990 10:36:22;WR;0x0515;Spg. N->PE Fehler;631964180: ENS::St: 30 SF: 0000 BF: 0006 BAS: 0000 BAI: 0000 WR::St: 900 SF: 0020 BF: 0006 BAS: 0000 BAI: 0000 EMS::St: 31 BF1: 00045213 BF2: 00000140 Ph: 0 LN: 0 Fan: 374
0631966026;  10.01.1990 11:07:06;Intern;0x2207;Batt-SN ungueltig
0631966029;  10.01.1990 11:07:09;CHRG;0x0402;Netzspg. zu klein;631966027: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 372
0631966029;  10.01.1990 11:07:09;CHRG;0x0403;Netzspg. zu gross
0631966029;  10.01.1990 11:07:09;CHRG;0x061D;Dry-Kontakt;631966029: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 372

simple code to read it
Note: ASll this does is read one record at a time into a list named row, and waits for you to press enter before reading next record,
but it reads reliably, so you can modify for your own purposes. Since each record is in a list, you can access individual parts with row[index], index vbeing an integer starting with 0 for first field, or to read all fields one by one just use for field in row:

import csv
import os


class ReadFiles:
    def __init__(self):
        # assure in proper directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        # you can add your file list code here
        list_raw_files = ['130000054.csv']
        for filename in list_raw_files:
            self.read_file_data(filename)

    def read_file_data(self, filename):
        with open(filename) as fp:
            csv_reader = crdr = csv.reader(fp, delimiter=';')
            for row in crdr:
                print(row)
                input()

if __name__ == '__main__':
    ReadFiles()

results of first few records:

Output:['0631960537', '  10.01.1990 09:35:37', 'Intern', '0x2207', 'Batt-SN ungueltig']

['0631960539', '  10.01.1990 09:35:39', 'CHRG', '0x0400', 'Netzfrequenz zu klein']

['0631960539', '  10.01.1990 09:35:39', 'CHRG', '0x0402', 'Netzspg. zu klein']

['0631960539', '  10.01.1990 09:35:39', 'CHRG', '0x0607', 'UBatt Min']

['0631960539', '  10.01.1990 09:35:39', 'CHRG', '0x0616', 'Uzwk Min']

['0631963784', '  10.01.1990 10:29:44', 'Intern', '0x2207', 'Batt-SN ungueltig']

['0631963787', '  10.01.1990 10:29:47', 'CHRG', '0x0400', 'Netzfrequenz zu klein']

['0631963787', '  10.01.1990 10:29:47', 'CHRG', '0x0402', 'Netzspg. zu klein']

['0631963905', '  10.01.1990 10:31:45', 'Intern', '0x2207', 'Batt-SN ungueltig']

['0631963909', '  10.01.1990 10:31:49', 'CHRG', '0x0400', 'Netzfrequenz zu klein', '631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342']

['0631963909', '  10.01.1990 10:31:49', 'CHRG', '0x0402', 'Netzspg. zu klein', '631963909: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 342']

['0631964140', '  10.01.1990 10:35:40', 'Intern', '0x2207', 'Batt-SN ungueltig']

karlito · Nov-08-2019, 10:39 AM

(Nov-07-2019, 11:59 PM)Larz60+ Wrote: Here's a sample of what the actual data looks like, and some code that will read it:
raw data sample:

Output:0631960537;  10.01.1990 09:35:37;Intern;0x2207;Batt-SN ungueltig
0631960539;  10.01.1990 09:35:39;CHRG;0x0400;Netzfrequenz zu klein
0631960539;  10.01.1990 09:35:39;CHRG;0x0402;Netzspg. zu klein
0631960539;  10.01.1990 09:35:39;CHRG;0x0607;UBatt Min
0631960539;  10.01.1990 09:35:39;CHRG;0x0616;Uzwk Min
0631963784;  10.01.1990 10:29:44;Intern;0x2207;Batt-SN ungueltig
0631963787;  10.01.1990 10:29:47;CHRG;0x0400;Netzfrequenz zu klein
0631963787;  10.01.1990 10:29:47;CHRG;0x0402;Netzspg. zu klein
0631963905;  10.01.1990 10:31:45;Intern;0x2207;Batt-SN ungueltig
0631963909;  10.01.1990 10:31:49;CHRG;0x0400;Netzfrequenz zu klein;631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342
0631963909;  10.01.1990 10:31:49;CHRG;0x0402;Netzspg. zu klein;631963909: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 342
0631964140;  10.01.1990 10:35:40;Intern;0x2207;Batt-SN ungueltig
0631964182;  10.01.1990 10:36:22;WR;0x0515;Spg. N->PE Fehler;631964180: ENS::St: 30 SF: 0000 BF: 0006 BAS: 0000 BAI: 0000 WR::St: 900 SF: 0020 BF: 0006 BAS: 0000 BAI: 0000 EMS::St: 31 BF1: 00045213 BF2: 00000140 Ph: 0 LN: 0 Fan: 374
0631966026;  10.01.1990 11:07:06;Intern;0x2207;Batt-SN ungueltig
0631966029;  10.01.1990 11:07:09;CHRG;0x0402;Netzspg. zu klein;631966027: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 372
0631966029;  10.01.1990 11:07:09;CHRG;0x0403;Netzspg. zu gross
0631966029;  10.01.1990 11:07:09;CHRG;0x061D;Dry-Kontakt;631966029: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 372

simple code to read it
Note: ASll this does is read one record at a time into a list named row, and waits for you to press enter before reading next record,
but it reads reliably, so you can modify for your own purposes. Since each record is in a list, you can access individual parts with row[index], index vbeing an integer starting with 0 for first field, or to read all fields one by one just use for field in row:

import csv
import os


class ReadFiles:
    def __init__(self):
        # assure in proper directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        # you can add your file list code here
        list_raw_files = ['130000054.csv']
        for filename in list_raw_files:
            self.read_file_data(filename)

    def read_file_data(self, filename):
        with open(filename) as fp:
            csv_reader = crdr = csv.reader(fp, delimiter=';')
            for row in crdr:
                print(row)
                input()

if __name__ == '__main__':
    ReadFiles()

results of first few records:

Output:['0631960537', '  10.01.1990 09:35:37', 'Intern', '0x2207', 'Batt-SN ungueltig']

['0631960539', '  10.01.1990 09:35:39', 'CHRG', '0x0400', 'Netzfrequenz zu klein']

['0631960539', '  10.01.1990 09:35:39', 'CHRG', '0x0402', 'Netzspg. zu klein']

['0631960539', '  10.01.1990 09:35:39', 'CHRG', '0x0607', 'UBatt Min']

['0631960539', '  10.01.1990 09:35:39', 'CHRG', '0x0616', 'Uzwk Min']

['0631963784', '  10.01.1990 10:29:44', 'Intern', '0x2207', 'Batt-SN ungueltig']

['0631963787', '  10.01.1990 10:29:47', 'CHRG', '0x0400', 'Netzfrequenz zu klein']

['0631963787', '  10.01.1990 10:29:47', 'CHRG', '0x0402', 'Netzspg. zu klein']

['0631963905', '  10.01.1990 10:31:45', 'Intern', '0x2207', 'Batt-SN ungueltig']

['0631963909', '  10.01.1990 10:31:49', 'CHRG', '0x0400', 'Netzfrequenz zu klein', '631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342']

['0631963909', '  10.01.1990 10:31:49', 'CHRG', '0x0402', 'Netzspg. zu klein', '631963909: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 342']

['0631964140', '  10.01.1990 10:35:40', 'Intern', '0x2207', 'Batt-SN ungueltig']

Hi Larz60+,

Thks for taking your time to help me. I'll try this option and give you a feedback.

**Larz60+** · Nov-08-2019, 06:39 PM

It looks like each field of the row should be stripped of white space, and it also looks like further parsing is required on certain records, for example 6th (index 5) element of row 10
the element that requires further parsing seems to appear when number of elements in row exceeds 5 (index > 4)

Output:
['0631963909', '  10.01.1990 10:31:49', 'CHRG', '0x0400', 'Netzfrequenz zu klein', '631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342']

karlito · Nov-11-2019, 06:48 AM

(Nov-08-2019, 06:39 PM)Larz60+ Wrote: It looks like each field of the row should be stripped of white space, and it also looks like further parsing is required on certain records, for example 6th (index 5) element of row 10
the element that requires further parsing seems to appear when number of elements in row exceeds 5 (index > 4)
Output:
['0631963909', '  10.01.1990 10:31:49', 'CHRG', '0x0400', 'Netzfrequenz zu klein', '631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342']

Exactly ! I've noticed that and try to drop those by checking the content of the first column if not in datetime format!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Merging rows and adding columns based on matching index	pythonnewbie78	3	825	Dec-24-2023, 11:51 AM Last Post: Pedroski55
	Pandas read_csv	markf7319	0	1,274	Mar-03-2022, 04:59 AM Last Post: markf7319
	groupby on var with missing values error	zenvega	0	1,757	May-07-2021, 07:40 PM Last Post: zenvega
	pandas read_csv can't handle missing data	mrdominikku	0	2,508	Jul-09-2020, 12:26 PM Last Post: mrdominikku
	Checking a filename before reading it with pd.read_csv	karlito	2	2,225	Oct-30-2019, 09:46 AM Last Post: karlito
	Drop rows if a set of columns has a value	dervast	1	1,989	Sep-12-2019, 04:18 PM Last Post: sd_0912
	display graph in columns and rows	william888	1	1,860	Jul-02-2019, 10:19 AM Last Post: dataman
	utf-8 error with pandas read_csv	logues	0	3,874	Oct-23-2018, 05:25 PM Last Post: logues
	Dropping all rows of multiple columns after the max of one cell	Thunberd	2	2,961	Jun-01-2018, 10:18 PM Last Post: Thunberd
	pandas read_csv, numbers in footer	mechanic310	1	2,819	May-22-2018, 10:38 AM Last Post: buran

read_csv error and rows/columns missing

User Panel Messages

Announcements