Python Forum
Read CSV Files with multiple headers into Python DataFrame - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Read CSV Files with multiple headers into Python DataFrame (/thread-1734.html)

Pages: 1 2


Read CSV Files with multiple headers into Python DataFrame - UGuntupalli - Jan-23-2017

I am trying to learn Python and started with this task of trying to import specific csv files in a given folder into a Python Data Type and then further processing the data. I am struggling with the part where the data needs to be imported into Python . I need this to be efficient. I tried a couple of things and attempted a couple of approaches based on suggestions provided in the forums and other web pages - all of them resulting in one problem or other . If any one can help solve this , the help would be greatly appreciated.
Method 1: Read data line by line:
with open(FilePath,"r") as f: 
        for line in f: 
            Data = f.readline()
FileData = [x.strip() for x in Data] 
This yields an empty array

Method 2: Read data using csv reader
f = open(FilePath,'rt')
try: 
    reader = csv.reader(f)
    for row in reader:
        print(row)
finally: 
    f.close()
This yields an error - "line contains NULL byte"

Method 3: To fix the above error , I followed a suggestion provided in Stackoverflow leading me to the following output :
f = open(FilePath,'rb')
data = csv.reader((line.replace('\0','') for line in f), delimiter=",")
print(data)
Method 4: Reading data into data frame 
DF = pd.read_csv(FilePath, skiprows=3) 

This yields the following error - Error tokenizing data. C error: Expected 1 fields in line 13, saw 2 

Upon searching further to fix this error - I ended up here . data = pd.read_csv('file1.csv', error_bad_lines=False) This reads the data correctly however it reads one character per row thus making it really hard to further use the data . 



RE: Read CSV Files with multiple headers into Python DataFrame - buran - Jan-23-2017

csv file
Output:
name,quatity,price item1,100,10.35 item2,5,110.00 item3,10,22.50
print 'plain vanila reading file\n'
with open('myfile.csv', 'r') as f: 
       file_data = [row.strip().split(',') for row in f][1:] #[1:] will skip the header line
print file_data

print '\n\nusing csv.reader\n'
import csv
with open('myfile.csv', 'r') as f:
   reader = csv.reader(f, delimiter=',')
   file_data = [row for row in reader][1:] # [1:] will skip the header line
print file_data

print '\n\nusing csv.DictReader\n'
with open('myfile.csv', 'r') as f:
   reader = csv.DictReader(f, delimiter=',')
   file_data = [row for row in reader]
print file_data


print '\n\nusing pandas\n'
import pandas as pd
DF = pd.read_csv('myfile.csv') 
print DF
Output:
plain vanila reading file [['item1', '100', '10.35'], ['item2', '5', '110.00'], ['item3', '10', '22.50']] using csv.reader [['item1', '100', '10.35'], ['item2', '5', '110.00'], ['item3', '10', '22.50']] using csv.DictReader [{'price': '10.35', 'quatity': '100', 'name': 'item1'}, {'price': '110.00', 'quatity': '5', 'name': 'item2'}, {'price': '22.50', 'quatity': '10', 'name': 'item3'}] using pandas    name  quatity   price 0  item1      100   10.35 1  item2        5  110.00 2  item3       10   22.50
Note the differences. Also note that depending on what you are going to do and the size of the file it may be better not to read the entire file in the memory.


RE: Read CSV Files with multiple headers into Python DataFrame - UGuntupalli - Jan-23-2017

@ Buran , 
              I have already tried this method , both using the skip rows and without the skip rows flag without any useful result. 

Without skip rows flag : 
DF = pd.read_csv(FilePath) 

Using Skip Rows :
DF = pd.read_csv(FilePath,, skiprows = 3) 

               In both cases - I receive the following error : 
"Error Tokenizing data . C Error : Expected 1 fields in line 13 , saw 2" 
               Upon further reading on this error , I have seen suggestions to use the following approach : 
DF = pd.read_csv(FilePath, error_bad_lines = False) 
This yields a row for each character - which is not what is desired 

Best 
U


RE: Read CSV Files with multiple headers into Python DataFrame - buran - Jan-23-2017

unless you upload sample data file, I don't see how we can help more. I provided 4 working examples for reading csv files. obviously there is something special with your input file.


RE: Read CSV Files with multiple headers into Python DataFrame - UGuntupalli - Jan-23-2017

I dont see a way to attach a file . Is there a way ? Apologies I am new to the forum


RE: Read CSV Files with multiple headers into Python DataFrame - buran - Jan-23-2017

upload file in the cloud service and provide link


RE: Read CSV Files with multiple headers into Python DataFrame - buran - Jan-23-2017

Also if you click New Reply (i.e. full edit mode) at the bottom of the screen there is interface to add attachments


RE: Read CSV Files with multiple headers into Python DataFrame - UGuntupalli - Jan-24-2017

@ Buran :

https://dl.dropboxusercontent.com/u/32778128/Test.csv

Please let me know if you can access this .


RE: Read CSV Files with multiple headers into Python DataFrame - buran - Jan-24-2017

my code from previous post works just fine with this sample csv file:

print '\n\nusing pandas with row 3 as column names\n'
import pandas as pd
DF = pd.read_csv('myfile.csv', skiprows=2) 
print DF.head(n=10) # print first 10 rows
Output:
using pandas with row 3 as column names  Timestamp (yyyy-mm-dd HH:MM:SS)    A  Bong    C    D    E      F    G    H  \ 0             2015-01-01 00:00:SS  734   734  734  734  734  734.0  734  734   1             2015-01-01 01:00:SS  816   816  816  816  816  816.0  816  816   2             2015-01-01 02:00:SS  114   114  114  114  114  114.0  114  114   3             2015-01-01 03:00:SS  822   822  822  822  822  822.0  822  822   4             2015-01-01 04:00:SS  569   569  569  569  569  569.0  569  569   5             2015-01-01 05:00:SS   87    87   87   87   87   87.0   87   87   6             2015-01-01 06:00:SS  250   250  250  250  250  250.0  250  250   7             2015-01-01 07:00:SS  492   492  492  492  492  492.0  492  492   8             2015-01-01 08:00:SS  862   862  862  862  862  862.0  862  862   9             2015-01-01 09:00:SS  869   869  869  869  869  869.0  869  869       I    J    K   0  734  734  734   1  816  816  816   2  114  114  114   3  822  822  822   4  569  569  569   5   87   87   87   6  250  250  250   7  492  492  492   8  862  862  862   9  869  869  869
also without column names:
print '\n\nusing pandas without column names\n'
import pandas as pd
DF = pd.read_csv('myfile.csv', skiprows=3, header=None) 
print DF.head(n=10) # print first 10 rows
Output:
using pandas without column names                     0    1    2    3    4    5      6    7    8    9    10  \ 0  2015-01-01 00:00:SS  734  734  734  734  734  734.0  734  734  734  734    1  2015-01-01 01:00:SS  816  816  816  816  816  816.0  816  816  816  816    2  2015-01-01 02:00:SS  114  114  114  114  114  114.0  114  114  114  114    3  2015-01-01 03:00:SS  822  822  822  822  822  822.0  822  822  822  822    4  2015-01-01 04:00:SS  569  569  569  569  569  569.0  569  569  569  569    5  2015-01-01 05:00:SS   87   87   87   87   87   87.0   87   87   87   87    6  2015-01-01 06:00:SS  250  250  250  250  250  250.0  250  250  250  250    7  2015-01-01 07:00:SS  492  492  492  492  492  492.0  492  492  492  492    8  2015-01-01 08:00:SS  862  862  862  862  862  862.0  862  862  862  862    9  2015-01-01 09:00:SS  869  869  869  869  869  869.0  869  869  869  869        11   0  734   1  816   2  114   3  822   4  569   5   87   6  250   7  492   8  862   9  869



RE: Read CSV Files with multiple headers into Python DataFrame - UGuntupalli - Jan-24-2017

@ Buran,
             I got one question - that is the only thing I am doing differently . How is it that all sample snippets show the file input as : 

 DF = pd.read_csv('myfile.csv' , skiprows = 2) 

             I pass the path string of the file like so : 
  
       DF = pd.read_csv(FilePath,skiprows = 2) 
  

              If you were looping through a folder , how would you know the name of each file to pass as an input ? Is this a preferred way to do it over passing the path ? Also , I am using Python 3.5 

Best 
U