Nov-14-2019, 09:47 PM
Hey,
So i'm using this dataset: https://www.kaggle.com/gregorut/videogamesales which contains 16 000 rows of data.
I'm doing some data cleaning: removing all the rows that contain null values.
Thanks to df.isna().sum() I was able to know that there are 271 rows that contain null values for the "Year" column and 58 rows that contain null values for the "Publisher" column.
I don't understand why I'm not getting 16 000 something rows for these 2 columns after data cleaning... Instead I get 116 non null rows for the "Year" column and 23 non null rows for the "NA_Sales" column....
Apparently, we can't do data cleaning in Hive...
So i'm using this dataset: https://www.kaggle.com/gregorut/videogamesales which contains 16 000 rows of data.
I'm doing some data cleaning: removing all the rows that contain null values.
Thanks to df.isna().sum() I was able to know that there are 271 rows that contain null values for the "Year" column and 58 rows that contain null values for the "Publisher" column.
I don't understand why I'm not getting 16 000 something rows for these 2 columns after data cleaning... Instead I get 116 non null rows for the "Year" column and 23 non null rows for the "NA_Sales" column....
Apparently, we can't do data cleaning in Hive...
import pandas as pd import numpy as np df = pd.read_csv('vgsales.csv') df.head(100) df.isna().sum() #######Before Data Cleaning########### df = df.dropna() df.replace(',','_') df.Year = df.Year.astype('int64') df.tail(100) df.to_csv(r'cleaned_vgsales.csv', sep = ',') import matplotlib.pyplot as plt import seaborn as sns from pyhive import hive conn = hive.connect("quickstart.cloudera", username="cloudera") cursor = conn.cursor() cursor.execute('DROP DATABASE IF EXISTS VG CASCADE') # if one was created before cursor.execute('CREATE DATABASE VG') cursor.execute("CREATE TABLE IF NOT EXISTS VG.sale (\ RANK INT,\ NAME STRING,\ PLATFORM STRING,\ YEAR INT,\ GENRE STRING,\ PUBLISHER STRING,\ NA_SALES FLOAT,\ EU_SALES FLOAT,\ JP_SALES FLOAT,\ OTHER_SALES FLOAT,\ GLOBAL_SALES FLOAT)\ ROW FORMAT DELIMITED \ FIELDS TERMINATED BY ','") cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/cleaned_vgsales.csv' OVERWRITE INTO TABLE VG.sale") cursor.execute('ALTER TABLE VG.sale SET TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables cursor.execute('SELECT * FROM VG.sale') ######AFTER DATA CLEANING######## query = cursor.fetchall() rows = pd.DataFrame(query, columns=(['RANK','NAME','PLATFORM','YEAR','GENRE','PUBLISHER','NA_SALES','EU_SALES','JP_SALES','OTHER_SALES','GLOBAL_SALES'])) print(rows.count())