Python Forum

Full Version: Issue with reading CSV file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hello,

I am creating CSV file in unix with help of sed command.
File is getting created correctly as below:
Output:
Column_1 Nunber(10,1) Column_1 Decimal(8,2) Column_1 Varchar
I am trying to read this file row by row in python as below
with open (datafile) as csvfile:
        reader = csv.reader(csvfile,dialect='excel')
        row = next(reader)
            for row in reader:
            sql = sql + (" ".join(row)) +  (",") + "\n"
        
sql= sql[:-2]
when i am trying to put this row in sql file the comma in datatype is missing like below
Output:
Column_1 Nunber(10 1) Column_1 Decimal(8 2)
Is there any reason for this ?
Line 4 should be indented at the same level as line 3 ('for' below 'row'). Configure your editor to insert 4 spaces when you hit the tab key. Python code is best indented with 4 spaces.

Can you print the rows read by for row in reader so that we can see what python reads, or post a csv file?
Below is the code.
I tried with print row and observations are as below
1. When file is created on windows and save as csv output is as expected i.e
[Column_1 Varchar(20)]
[Column_2 Number(10,2)]
[Column_3 Decimal(4,1)]
and it reads the row correctly.

2. When same file is taken to Unix it is read as below
['Column_1', 'Varchar(20)']
['Column_2', 'Number(10', '2)']
['Column_3', 'Decimal(4', '1)']

There seems the problem for , (Comma )it is treating it as separate string.

import csv,os,sys

if len(sys.argv)<2:
         print ("\nUsage: csv2tbl.py path/datafile.csv (0,1,2,3 = column name format):")
         print ("\nFormat: 0 = TitleCasedWords")
         print ("        1 = Titlecased_Words_Underscored")
         print ("        2 = lowercase_words_underscored")
         print ("        3 = Words_underscored_only (leave case as in source)")
         sys.exit()
else:
         if len(sys.argv)==3:
                  dummy,schemaname, datafile, = sys.argv
                  namefmt = '0'
         else: dummy, datafile, namefmt = sys.argv


#outfile = os.path.basename(datafile)
filename = os.path.basename(datafile).split('.')[0]
outfile = os.path.dirname(datafile)  + filename + '.sql'

tblname = schemaname + '.' + filename


partition_param_1 = 'ingestion_year  int '
partition_param_2 = 'ingestion_month  int'
partition_param_3 = 'ingestion_day int'
partition_string = partition_param_1 + ',' + partition_param_2 + ',' + partition_param_3

row_format='org.apache.hadoop.hive.serde2.avro.AveroSerDe'
stored_as=''
output_format=''
location='/HADOOP/RAW/' + schemaname + '/' + tblname + '/GOOD'
table_properties= '/HADOOP/RAW/' + schemaname + '/' + tblname + '/GOOD'

    


sql = 'CREATE EXTERNAL TABLE %s\n(' % (tblname)
# Create list of column [names],[widths]
with open (datafile) as csvfile:
        reader = csv.reader(csvfile,dialect='excel')
        row = next(reader)
        for row in reader:
            print(row)
            sql = sql + (" ".join(row)) +  (",") + "\n"
        
sql= sql[:-2]

sql = sql + ') \n Partition By (' + partition_string +')'
sql = sql + ' \n ROW FORMAT SERDE (' + row_format +')'
sql = sql + ' \n STORED AS (' + stored_as +')'
sql = sql + ' \n OUTPUT FORMAT (' + stored_as +')'
sql = sql + ' \n LOCATION (' + location +')'
sql = sql + ' \n TABLE PROPERTIES (' + table_properties +')'


with  open(outfile,'w') as sqlfile:
    sqlfile.write(sql)

sqlfile.close

print ('%s created.' % (outfile))
You may need to adapt the csv reader's dialect. The question is what does the csv file contain exactly? Can you post a few lines of this file ?
And print the sql variable. It looks like you are missing one closing paren. Note also that sqlfile is not an SQL file. If you can post a sample csv
(not necessary to use actual data), and say what you want to do, i.e. do you want to create and SQL table, and if so, what SQL engine are you using.
The file has below contents,

Column_1 Varchar(20)
Column_2 Number(10,2)
Column_3 Decimal(4,1)
If the file is tab-separated, you can try dialect=csv.excel_tab in the call to csv.reader(). If the values are merely separated by space characters, you can define your own dialect
class MyDialect(csv.Dialect):
    delimiter = ' '
    doublequote = True
    lineterminator = '\r\n'
    quotechar = '"'
    quoting = csv.MINIMAL
    skipinitialspace = False
Then use dialect=MyDialect in csv.reader() until it reads ['Column_2', 'Number(10, 2)']
Thank you For your reply.

I am creating the CSV file in unix with help of describer command (Hive Describe).
Unfortunately I can't paste exact file.
But when i am opening same file in Excel it is comma separated with complete row in one line of excel.
nnsatpute Wrote:I can't paste exact file.
You can do this
with open(datafile) as csvfile:
    s = csvfile.read(4096)
    print(repr(s))
Then paste the output in the forum.

nnsatpute Wrote:But when i am opening same file in Excel it is comma separated
If the file is comma separated, there is a problem with Number(10,2) because the comma must not be interpreted as a separating comma. Normally csv files avoid this by quoting "Number(10,2)". You need to know which quoting strategy is used and declare this in the dialect.
with open(datafile) as csvfile:
    s = csvfile.read(4096)
    print(repr(s))
The output of above code looks like
column1,varchar\n column2,decimal(10,2)\n column3,number(4,3)
Pages: 1 2