Data Cleaning with PyHive

PythonSpeaker · Nov-14-2019, 09:47 PM

Hey,

So i'm using this dataset: https://www.kaggle.com/gregorut/videogamesales which contains 16 000 rows of data.

I'm doing some data cleaning: removing all the rows that contain null values.

Thanks to df.isna().sum() I was able to know that there are 271 rows that contain null values for the "Year" column and 58 rows that contain null values for the "Publisher" column.

I don't understand why I'm not getting 16 000 something rows for these 2 columns after data cleaning... Instead I get 116 non null rows for the "Year" column and 23 non null rows for the "NA_Sales" column....

Apparently, we can't do data cleaning in Hive...

import pandas as pd

import numpy as np

df = pd.read_csv('vgsales.csv')

df.head(100)

df.isna().sum() #######Before Data Cleaning###########

df = df.dropna()

df.replace(',','_')

df.Year = df.Year.astype('int64')

df.tail(100)

df.to_csv(r'cleaned_vgsales.csv', sep = ',')

import matplotlib.pyplot as plt

import seaborn as sns

from pyhive import hive

conn = hive.connect("quickstart.cloudera", username="cloudera")

cursor = conn.cursor()

cursor.execute('DROP DATABASE IF EXISTS VG CASCADE') # if one was created before

cursor.execute('CREATE DATABASE VG')

cursor.execute("CREATE TABLE IF NOT EXISTS
VG.sale (\

RANK INT,\

NAME STRING,\

PLATFORM STRING,\

YEAR INT,\

GENRE STRING,\

PUBLISHER STRING,\

NA_SALES FLOAT,\

EU_SALES FLOAT,\

JP_SALES FLOAT,\

OTHER_SALES FLOAT,\

GLOBAL_SALES FLOAT)\

ROW FORMAT DELIMITED \

FIELDS TERMINATED BY ','")

cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/cleaned_vgsales.csv' OVERWRITE INTO TABLE
VG.sale")

cursor.execute('ALTER TABLE
VG.sale SET TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line

cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables

cursor.execute('SELECT * FROM
VG.sale') ######AFTER DATA CLEANING########

query = cursor.fetchall()

rows = pd.DataFrame(query, columns=(['RANK','NAME','PLATFORM','YEAR','GENRE','PUBLISHER','NA_SALES','EU_SALES','JP_SALES','OTHER_SALES','GLOBAL_SALES']))

print(rows.count())

***micseydel*** · Nov-15-2019, 09:11 PM

Can you provide your dataset? Or one which has had any proprietary information removed, but which would allow us to reproduce this problem locally?

PythonSpeaker · Nov-16-2019, 10:30 AM

This is the original dataset: 'vgsales.csv'
The cleaned dataset: 'cleaned_vgsales.csv'

https://drive.google.com/drive/folders/1...sp=sharing

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	NPL function: data cleaning a variables' categories	Netherlands_Hi	1	2,117	Mar-11-2020, 08:20 PM Last Post: Larz60+

Data Cleaning with PyHive

User Panel Messages

Announcements