Dec-10-2018, 01:08 PM
Below is the code.
I tried with print row and observations are as below
1. When file is created on windows and save as csv output is as expected i.e
[Column_1 Varchar(20)]
[Column_2 Number(10,2)]
[Column_3 Decimal(4,1)]
and it reads the row correctly.
2. When same file is taken to Unix it is read as below
['Column_1', 'Varchar(20)']
['Column_2', 'Number(10', '2)']
['Column_3', 'Decimal(4', '1)']
There seems the problem for , (Comma )it is treating it as separate string.
I tried with print row and observations are as below
1. When file is created on windows and save as csv output is as expected i.e
[Column_1 Varchar(20)]
[Column_2 Number(10,2)]
[Column_3 Decimal(4,1)]
and it reads the row correctly.
2. When same file is taken to Unix it is read as below
['Column_1', 'Varchar(20)']
['Column_2', 'Number(10', '2)']
['Column_3', 'Decimal(4', '1)']
There seems the problem for , (Comma )it is treating it as separate string.
import csv,os,sys if len(sys.argv)<2: print ("\nUsage: csv2tbl.py path/datafile.csv (0,1,2,3 = column name format):") print ("\nFormat: 0 = TitleCasedWords") print (" 1 = Titlecased_Words_Underscored") print (" 2 = lowercase_words_underscored") print (" 3 = Words_underscored_only (leave case as in source)") sys.exit() else: if len(sys.argv)==3: dummy,schemaname, datafile, = sys.argv namefmt = '0' else: dummy, datafile, namefmt = sys.argv #outfile = os.path.basename(datafile) filename = os.path.basename(datafile).split('.')[0] outfile = os.path.dirname(datafile) + filename + '.sql' tblname = schemaname + '.' + filename partition_param_1 = 'ingestion_year int ' partition_param_2 = 'ingestion_month int' partition_param_3 = 'ingestion_day int' partition_string = partition_param_1 + ',' + partition_param_2 + ',' + partition_param_3 row_format='org.apache.hadoop.hive.serde2.avro.AveroSerDe' stored_as='' output_format='' location='/HADOOP/RAW/' + schemaname + '/' + tblname + '/GOOD' table_properties= '/HADOOP/RAW/' + schemaname + '/' + tblname + '/GOOD' sql = 'CREATE EXTERNAL TABLE %s\n(' % (tblname) # Create list of column [names],[widths] with open (datafile) as csvfile: reader = csv.reader(csvfile,dialect='excel') row = next(reader) for row in reader: print(row) sql = sql + (" ".join(row)) + (",") + "\n" sql= sql[:-2] sql = sql + ') \n Partition By (' + partition_string +')' sql = sql + ' \n ROW FORMAT SERDE (' + row_format +')' sql = sql + ' \n STORED AS (' + stored_as +')' sql = sql + ' \n OUTPUT FORMAT (' + stored_as +')' sql = sql + ' \n LOCATION (' + location +')' sql = sql + ' \n TABLE PROPERTIES (' + table_properties +')' with open(outfile,'w') as sqlfile: sqlfile.write(sql) sqlfile.close print ('%s created.' % (outfile))