Posts: 9
Threads: 3
Joined: Dec 2018
Dec-10-2018, 11:38 AM
(This post was last modified: Dec-10-2018, 12:38 PM by Gribouillis.)
Hello,
I am creating CSV file in unix with help of sed command.
File is getting created correctly as below:
Output: Column_1 Nunber(10,1)
Column_1 Decimal(8,2)
Column_1 Varchar
I am trying to read this file row by row in python as below
with open (datafile) as csvfile:
reader = csv.reader(csvfile,dialect='excel')
row = next(reader)
for row in reader:
sql = sql + (" ".join(row)) + (",") + "\n"
sql= sql[:-2] when i am trying to put this row in sql file the comma in datatype is missing like below
Output: Column_1 Nunber(10 1)
Column_1 Decimal(8 2)
Is there any reason for this ?
Posts: 4,787
Threads: 76
Joined: Jan 2018
Dec-10-2018, 12:46 PM
(This post was last modified: Dec-10-2018, 12:46 PM by Gribouillis.)
Line 4 should be indented at the same level as line 3 ('for' below 'row'). Configure your editor to insert 4 spaces when you hit the tab key. Python code is best indented with 4 spaces.
Can you print the rows read by for row in reader so that we can see what python reads, or post a csv file?
Posts: 9
Threads: 3
Joined: Dec 2018
Below is the code.
I tried with print row and observations are as below
1. When file is created on windows and save as csv output is as expected i.e
[Column_1 Varchar(20)]
[Column_2 Number(10,2)]
[Column_3 Decimal(4,1)]
and it reads the row correctly.
2. When same file is taken to Unix it is read as below
['Column_1', 'Varchar(20)']
['Column_2', 'Number(10', '2)']
['Column_3', 'Decimal(4', '1)']
There seems the problem for , (Comma )it is treating it as separate string.
import csv,os,sys
if len(sys.argv)<2:
print ("\nUsage: csv2tbl.py path/datafile.csv (0,1,2,3 = column name format):")
print ("\nFormat: 0 = TitleCasedWords")
print (" 1 = Titlecased_Words_Underscored")
print (" 2 = lowercase_words_underscored")
print (" 3 = Words_underscored_only (leave case as in source)")
sys.exit()
else:
if len(sys.argv)==3:
dummy,schemaname, datafile, = sys.argv
namefmt = '0'
else: dummy, datafile, namefmt = sys.argv
#outfile = os.path.basename(datafile)
filename = os.path.basename(datafile).split('.')[0]
outfile = os.path.dirname(datafile) + filename + '.sql'
tblname = schemaname + '.' + filename
partition_param_1 = 'ingestion_year int '
partition_param_2 = 'ingestion_month int'
partition_param_3 = 'ingestion_day int'
partition_string = partition_param_1 + ',' + partition_param_2 + ',' + partition_param_3
row_format='org.apache.hadoop.hive.serde2.avro.AveroSerDe'
stored_as=''
output_format=''
location='/HADOOP/RAW/' + schemaname + '/' + tblname + '/GOOD'
table_properties= '/HADOOP/RAW/' + schemaname + '/' + tblname + '/GOOD'
sql = 'CREATE EXTERNAL TABLE %s\n(' % (tblname)
# Create list of column [names],[widths]
with open (datafile) as csvfile:
reader = csv.reader(csvfile,dialect='excel')
row = next(reader)
for row in reader:
print(row)
sql = sql + (" ".join(row)) + (",") + "\n"
sql= sql[:-2]
sql = sql + ') \n Partition By (' + partition_string +')'
sql = sql + ' \n ROW FORMAT SERDE (' + row_format +')'
sql = sql + ' \n STORED AS (' + stored_as +')'
sql = sql + ' \n OUTPUT FORMAT (' + stored_as +')'
sql = sql + ' \n LOCATION (' + location +')'
sql = sql + ' \n TABLE PROPERTIES (' + table_properties +')'
with open(outfile,'w') as sqlfile:
sqlfile.write(sql)
sqlfile.close
print ('%s created.' % (outfile))
Posts: 4,787
Threads: 76
Joined: Jan 2018
You may need to adapt the csv reader's dialect. The question is what does the csv file contain exactly? Can you post a few lines of this file ?
Posts: 536
Threads: 0
Joined: Feb 2018
Dec-10-2018, 04:34 PM
(This post was last modified: Dec-10-2018, 04:36 PM by woooee.)
And print the sql variable. It looks like you are missing one closing paren. Note also that sqlfile is not an SQL file. If you can post a sample csv
(not necessary to use actual data), and say what you want to do, i.e. do you want to create and SQL table, and if so, what SQL engine are you using.
Posts: 9
Threads: 3
Joined: Dec 2018
The file has below contents,
Column_1 Varchar(20)
Column_2 Number(10,2)
Column_3 Decimal(4,1)
Posts: 4,787
Threads: 76
Joined: Jan 2018
If the file is tab-separated, you can try dialect=csv.excel_tab in the call to csv.reader() . If the values are merely separated by space characters, you can define your own dialect
class MyDialect(csv.Dialect):
delimiter = ' '
doublequote = True
lineterminator = '\r\n'
quotechar = '"'
quoting = csv.MINIMAL
skipinitialspace = False Then use dialect=MyDialect in csv.reader() until it reads ['Column_2', 'Number(10, 2)']
Posts: 9
Threads: 3
Joined: Dec 2018
Thank you For your reply.
I am creating the CSV file in unix with help of describer command (Hive Describe).
Unfortunately I can't paste exact file.
But when i am opening same file in Excel it is comma separated with complete row in one line of excel.
Posts: 4,787
Threads: 76
Joined: Jan 2018
Dec-11-2018, 06:35 AM
(This post was last modified: Dec-11-2018, 06:35 AM by Gribouillis.)
nnsatpute Wrote:I can't paste exact file. You can do this
with open(datafile) as csvfile:
s = csvfile.read(4096)
print(repr(s)) Then paste the output in the forum.
nnsatpute Wrote:But when i am opening same file in Excel it is comma separated If the file is comma separated, there is a problem with Number(10,2) because the comma must not be interpreted as a separating comma. Normally csv files avoid this by quoting "Number(10,2)" . You need to know which quoting strategy is used and declare this in the dialect.
Posts: 9
Threads: 3
Joined: Dec 2018
with open(datafile) as csvfile:
s = csvfile.read(4096)
print(repr(s)) The output of above code looks like
column1,varchar\n column2,decimal(10,2)\n column3,number(4,3)
|