Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 read_csv error and rows/columns missing
#1
I thought I got everything right till I ended up with a different file to read. Damn for 2 days I'm struggling with it. Help, please.
link to the data:file.zip

Thks for your time

# Location of all files 
file_folder = 'path_on_your_computer'

# Save the files into a list (when more than 2)
list_raw_files = [f for f in listdir(file_folder) if isfile(join(file_folder, f))]

# Load the right/given file
for raw_file in list_raw_files:   
    
    #  Check the file 
    if raw_file.startswith('130'): #print (raw_file)
        
        temp_list = []
        
        for chunk in pd.read_csv(file_folder + raw_file, sep = ';', header = None, chunksize = 20000, error_bad_lines = False , low_memory=False):
            
            temp_list.append(chunk)
        
        data = pd.concat(temp_list, axis = 0)
        
        del temp_list

data.head(30)
I have this Error/Warning:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

and the data is not complete!
link to my result: output
Quote
#2
  • zip file is corrupt
    to get list of files in directory, use:
    from pathlib import Path
    from tkinter import filedialog as fd
    
    def get_dir_file_list():
        fdir = Path(fd.askdirectory())
        fl = [fn for fn in fdir.iterdir() if fn.is_file()]
        return fl
    
    if __name__ == '__main__':
        filelist = get_dir_file_list()
        for filename in filelist:
            print(f"filename: {filename.name}")
    
Quote
#3
(Nov-05-2019, 03:28 PM)Larz60+ Wrote:
  • zip file is corrupt
    to get list of files in directory, use:
    from pathlib import Path
    from tkinter import filedialog as fd
    
    def get_dir_file_list():
        fdir = Path(fd.askdirectory())
        fl = [fn for fn in fdir.iterdir() if fn.is_file()]
        return fl
    
    if __name__ == '__main__':
        filelist = get_dir_file_list()
        for filename in filelist:
            print(f"filename: {filename.name}")
    

Thks for your response and sorry for the late answer,
but to get a list of files in a dir is not my issue here but to load the file because I already get the file but reading it is the problem (if you see the pic)... the zip file is corrupted? how? I can unzip it and see the csv file
Quote
#4
Quote:the zip file is corrupted? how? I can unzip it and see the csv file
I cannot
Quote
#5
(Nov-05-2019, 10:04 AM)karlito Wrote: I thought I got everything right till I ended up with a different file to read. Damn for 2 days I'm struggling with it. Help, please.
link to the data:file.zip

Thks for your time

# Location of all files 
file_folder = 'path_on_your_computer'

# Save the files into a list (when more than 2)
list_raw_files = [f for f in listdir(file_folder) if isfile(join(file_folder, f))]

# Load the right/given file
for raw_file in list_raw_files:   
    
    #  Check the file 
    if raw_file.startswith('130'): #print (raw_file)
        
        temp_list = []
        
        for chunk in pd.read_csv(file_folder + raw_file, sep = ';', header = None, chunksize = 20000, error_bad_lines = False , low_memory=False):
            
            temp_list.append(chunk)
        
        data = pd.concat(temp_list, axis = 0)
        
        del temp_list

data.head(30)
I have this Error/Warning:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

and the data is not complete!
link to my result: output

Try this jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10 or try adding time.sleep(1) inside for loop
Quote
#6
got a good file.zip this time.

I was able to get this to run (with errors) by reducing the chunk size and a few other adjustments.
I have 32GB or memory, and a chunk size of 20,000 blew up.
if you can't get the code below to run (only one file), reduce chunk size until it does.
And don't forget, you're appending the entire file into a temporary list, so if you don't have enough memory, this will either blow up, or be stuck paging memory for a long time.
import os
import pandas as pd

# assure in proper directory
os.chdir(os.path.abspath(os.path.dirname(__file__)))

# for my test
list_raw_files = ['130000054.csv']

# Location of all files 
# file_folder = 'path_on_your_computer'
 
# # Save the files into a list (when more than 2)
# list_raw_files = [f for f in listdir(file_folder) if isfile(join(file_folder, f))]
 
# Load the right/given file
for raw_file in list_raw_files:   
     
    #  Check the file 
    if raw_file.startswith('130'): #print (raw_file)
         
        temp_list = []
        file_folder = './'
        for chunk in pd.read_csv(file_folder + raw_file, sep = ';', header = None, chunksize = 10000, error_bad_lines = False , low_memory=False):
             
            temp_list.append(chunk)
         
        data = pd.concat(temp_list, axis = 0)
         
        del temp_list
 
data.head(30)
partial error list:
Output:
b'Skipping line 10: expected 5 fields, saw 6\nSkipping line 11: expected 5 fields, saw 6\nSkipping line 13: expected 5 fields, saw 6\nSkipping line 15: expected 5 fields, saw 6\nSkipping line 17: expected 5 fields, saw 6\nSkipping line 23: expected 5 fields, saw 6\nSkipping line 24: expected 5 fields, saw 6\nSkipping line 25: expected 5 fields, saw 6\nSkipping line 27: expected 5 fields, saw 96\nSkipping line 28: expected 5 fields, saw 96\nSkipping line 29: expected 5 fields, saw 96\nSkipping line 30: expected 5 fields, saw 96\nSkipping line 31: expected 5 fields, saw 96\nSkipping line 32: expected 5 fields, saw 96\nSkipping line 33: expected 5 fields, saw 96\nSkipping line 34: expected 5 fields, saw 6\nSkipping line 35: expected 5 fields, saw 96\nSkipping line 36: expected 5 fields, saw 96\nSkipping line 37: expected 5 fields, saw 6\nSkipping line 38: expected 5 fields, saw 96\nSkipping line 39: expected 5 fields, saw 96\nSkipping line 40: expected 5 fields, saw 96\nSkipping line 41: expected 5 fields, saw 6\nSkipping line 42: expected 5 fields, saw 96\nSkipping line 43: expected 5
Quote
#7
Here's a sample of what the actual data looks like, and some code that will read it:
raw data sample:
Output:
0631960537; 10.01.1990 09:35:37;Intern;0x2207;Batt-SN ungueltig 0631960539; 10.01.1990 09:35:39;CHRG;0x0400;Netzfrequenz zu klein 0631960539; 10.01.1990 09:35:39;CHRG;0x0402;Netzspg. zu klein 0631960539; 10.01.1990 09:35:39;CHRG;0x0607;UBatt Min 0631960539; 10.01.1990 09:35:39;CHRG;0x0616;Uzwk Min 0631963784; 10.01.1990 10:29:44;Intern;0x2207;Batt-SN ungueltig 0631963787; 10.01.1990 10:29:47;CHRG;0x0400;Netzfrequenz zu klein 0631963787; 10.01.1990 10:29:47;CHRG;0x0402;Netzspg. zu klein 0631963905; 10.01.1990 10:31:45;Intern;0x2207;Batt-SN ungueltig 0631963909; 10.01.1990 10:31:49;CHRG;0x0400;Netzfrequenz zu klein;631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342 0631963909; 10.01.1990 10:31:49;CHRG;0x0402;Netzspg. zu klein;631963909: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 342 0631964140; 10.01.1990 10:35:40;Intern;0x2207;Batt-SN ungueltig 0631964182; 10.01.1990 10:36:22;WR;0x0515;Spg. N->PE Fehler;631964180: ENS::St: 30 SF: 0000 BF: 0006 BAS: 0000 BAI: 0000 WR::St: 900 SF: 0020 BF: 0006 BAS: 0000 BAI: 0000 EMS::St: 31 BF1: 00045213 BF2: 00000140 Ph: 0 LN: 0 Fan: 374 0631966026; 10.01.1990 11:07:06;Intern;0x2207;Batt-SN ungueltig 0631966029; 10.01.1990 11:07:09;CHRG;0x0402;Netzspg. zu klein;631966027: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 372 0631966029; 10.01.1990 11:07:09;CHRG;0x0403;Netzspg. zu gross 0631966029; 10.01.1990 11:07:09;CHRG;0x061D;Dry-Kontakt;631966029: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 372
simple code to read it
Note: ASll this does is read one record at a time into a list named row, and waits for you to press enter before reading next record,
but it reads reliably, so you can modify for your own purposes. Since each record is in a list, you can access individual parts with row[index], index vbeing an integer starting with 0 for first field, or to read all fields one by one just use for field in row:
import csv
import os


class ReadFiles:
    def __init__(self):
        # assure in proper directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        # you can add your file list code here
        list_raw_files = ['130000054.csv']
        for filename in list_raw_files:
            self.read_file_data(filename)

    def read_file_data(self, filename):
        with open(filename) as fp:
            csv_reader = crdr = csv.reader(fp, delimiter=';')
            for row in crdr:
                print(row)
                input()

if __name__ == '__main__':
    ReadFiles()
results of first few records:
Output:
['0631960537', ' 10.01.1990 09:35:37', 'Intern', '0x2207', 'Batt-SN ungueltig'] ['0631960539', ' 10.01.1990 09:35:39', 'CHRG', '0x0400', 'Netzfrequenz zu klein'] ['0631960539', ' 10.01.1990 09:35:39', 'CHRG', '0x0402', 'Netzspg. zu klein'] ['0631960539', ' 10.01.1990 09:35:39', 'CHRG', '0x0607', 'UBatt Min'] ['0631960539', ' 10.01.1990 09:35:39', 'CHRG', '0x0616', 'Uzwk Min'] ['0631963784', ' 10.01.1990 10:29:44', 'Intern', '0x2207', 'Batt-SN ungueltig'] ['0631963787', ' 10.01.1990 10:29:47', 'CHRG', '0x0400', 'Netzfrequenz zu klein'] ['0631963787', ' 10.01.1990 10:29:47', 'CHRG', '0x0402', 'Netzspg. zu klein'] ['0631963905', ' 10.01.1990 10:31:45', 'Intern', '0x2207', 'Batt-SN ungueltig'] ['0631963909', ' 10.01.1990 10:31:49', 'CHRG', '0x0400', 'Netzfrequenz zu klein', '631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342'] ['0631963909', ' 10.01.1990 10:31:49', 'CHRG', '0x0402', 'Netzspg. zu klein', '631963909: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 342'] ['0631964140', ' 10.01.1990 10:35:40', 'Intern', '0x2207', 'Batt-SN ungueltig']
Quote
#8
(Nov-07-2019, 11:59 PM)Larz60+ Wrote: Here's a sample of what the actual data looks like, and some code that will read it:
raw data sample:
Output:
0631960537; 10.01.1990 09:35:37;Intern;0x2207;Batt-SN ungueltig 0631960539; 10.01.1990 09:35:39;CHRG;0x0400;Netzfrequenz zu klein 0631960539; 10.01.1990 09:35:39;CHRG;0x0402;Netzspg. zu klein 0631960539; 10.01.1990 09:35:39;CHRG;0x0607;UBatt Min 0631960539; 10.01.1990 09:35:39;CHRG;0x0616;Uzwk Min 0631963784; 10.01.1990 10:29:44;Intern;0x2207;Batt-SN ungueltig 0631963787; 10.01.1990 10:29:47;CHRG;0x0400;Netzfrequenz zu klein 0631963787; 10.01.1990 10:29:47;CHRG;0x0402;Netzspg. zu klein 0631963905; 10.01.1990 10:31:45;Intern;0x2207;Batt-SN ungueltig 0631963909; 10.01.1990 10:31:49;CHRG;0x0400;Netzfrequenz zu klein;631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342 0631963909; 10.01.1990 10:31:49;CHRG;0x0402;Netzspg. zu klein;631963909: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 342 0631964140; 10.01.1990 10:35:40;Intern;0x2207;Batt-SN ungueltig 0631964182; 10.01.1990 10:36:22;WR;0x0515;Spg. N->PE Fehler;631964180: ENS::St: 30 SF: 0000 BF: 0006 BAS: 0000 BAI: 0000 WR::St: 900 SF: 0020 BF: 0006 BAS: 0000 BAI: 0000 EMS::St: 31 BF1: 00045213 BF2: 00000140 Ph: 0 LN: 0 Fan: 374 0631966026; 10.01.1990 11:07:06;Intern;0x2207;Batt-SN ungueltig 0631966029; 10.01.1990 11:07:09;CHRG;0x0402;Netzspg. zu klein;631966027: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 372 0631966029; 10.01.1990 11:07:09;CHRG;0x0403;Netzspg. zu gross 0631966029; 10.01.1990 11:07:09;CHRG;0x061D;Dry-Kontakt;631966029: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 372
simple code to read it
Note: ASll this does is read one record at a time into a list named row, and waits for you to press enter before reading next record,
but it reads reliably, so you can modify for your own purposes. Since each record is in a list, you can access individual parts with row[index], index vbeing an integer starting with 0 for first field, or to read all fields one by one just use for field in row:
import csv
import os


class ReadFiles:
    def __init__(self):
        # assure in proper directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        # you can add your file list code here
        list_raw_files = ['130000054.csv']
        for filename in list_raw_files:
            self.read_file_data(filename)

    def read_file_data(self, filename):
        with open(filename) as fp:
            csv_reader = crdr = csv.reader(fp, delimiter=';')
            for row in crdr:
                print(row)
                input()

if __name__ == '__main__':
    ReadFiles()
results of first few records:
Output:
['0631960537', ' 10.01.1990 09:35:37', 'Intern', '0x2207', 'Batt-SN ungueltig'] ['0631960539', ' 10.01.1990 09:35:39', 'CHRG', '0x0400', 'Netzfrequenz zu klein'] ['0631960539', ' 10.01.1990 09:35:39', 'CHRG', '0x0402', 'Netzspg. zu klein'] ['0631960539', ' 10.01.1990 09:35:39', 'CHRG', '0x0607', 'UBatt Min'] ['0631960539', ' 10.01.1990 09:35:39', 'CHRG', '0x0616', 'Uzwk Min'] ['0631963784', ' 10.01.1990 10:29:44', 'Intern', '0x2207', 'Batt-SN ungueltig'] ['0631963787', ' 10.01.1990 10:29:47', 'CHRG', '0x0400', 'Netzfrequenz zu klein'] ['0631963787', ' 10.01.1990 10:29:47', 'CHRG', '0x0402', 'Netzspg. zu klein'] ['0631963905', ' 10.01.1990 10:31:45', 'Intern', '0x2207', 'Batt-SN ungueltig'] ['0631963909', ' 10.01.1990 10:31:49', 'CHRG', '0x0400', 'Netzfrequenz zu klein', '631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342'] ['0631963909', ' 10.01.1990 10:31:49', 'CHRG', '0x0402', 'Netzspg. zu klein', '631963909: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 911 BF1: 00041813 BF2: 00100140 Ph: 0 LN: 0 Fan: 342'] ['0631964140', ' 10.01.1990 10:35:40', 'Intern', '0x2207', 'Batt-SN ungueltig']

Hi Larz60+,

Thks for taking your time to help me. I'll try this option and give you a feedback.
Quote
#9
It looks like each field of the row should be stripped of white space, and it also looks like further parsing is required on certain records, for example 6th (index 5) element of row 10
the element that requires further parsing seems to appear when number of elements in row exceeds 5 (index > 4)
Output:
['0631963909', ' 10.01.1990 10:31:49', 'CHRG', '0x0400', 'Netzfrequenz zu klein', '631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342']
Quote
#10
(Nov-08-2019, 06:39 PM)Larz60+ Wrote: It looks like each field of the row should be stripped of white space, and it also looks like further parsing is required on certain records, for example 6th (index 5) element of row 10
the element that requires further parsing seems to appear when number of elements in row exceeds 5 (index > 4)
Output:
['0631963909', ' 10.01.1990 10:31:49', 'CHRG', '0x0400', 'Netzfrequenz zu klein', '631963907: ENS::St: 30 SF: 0000 BF: 000a BAS: 0000 BAI: 0000 WR::St: 1 SF: 0000 BF: 0005 BAS: 0000 BAI: 0000 EMS::St: 18 BF1: 00041213 BF2: 00100140 Ph: 0 LN: 0 Fan: 342']

Exactly ! I've noticed that and try to drop those by checking the content of the first column if not in datetime format!
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Checking a filename before reading it with pd.read_csv karlito 2 180 Oct-30-2019, 09:46 AM
Last Post: karlito
  Drop rows if a set of columns has a value dervast 1 154 Sep-12-2019, 04:18 PM
Last Post: sd_0912
  display graph in columns and rows william888 1 303 Jul-02-2019, 10:19 AM
Last Post: dataman
  utf-8 error with pandas read_csv logues 0 999 Oct-23-2018, 05:25 PM
Last Post: logues
  Dropping all rows of multiple columns after the max of one cell Thunberd 2 702 Jun-01-2018, 10:18 PM
Last Post: Thunberd
  pandas read_csv, numbers in footer mechanic310 1 683 May-22-2018, 10:38 AM
Last Post: buran
  Get rows with same value from dataframe of particular columns angelwings 1 702 Apr-11-2018, 02:40 AM
Last Post: scidam
  summing rows/columns more quickly jon0852 3 822 Feb-12-2018, 02:24 AM
Last Post: ka06059
  Stack dataframe columns into rows klllmmm 0 1,194 Sep-03-2017, 02:26 AM
Last Post: klllmmm

Forum Jump:


Users browsing this thread: 1 Guest(s)