Python Forum
Data Cleaning with PyHive
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Data Cleaning with PyHive
#1
Hey,

So i'm using this dataset: https://www.kaggle.com/gregorut/videogamesales which contains 16 000 rows of data.

I'm doing some data cleaning: removing all the rows that contain null values.

Thanks to df.isna().sum() I was able to know that there are 271 rows that contain null values for the "Year" column and 58 rows that contain null values for the "Publisher" column.

I don't understand why I'm not getting 16 000 something rows for these 2 columns after data cleaning... Instead I get 116 non null rows for the "Year" column and 23 non null rows for the "NA_Sales" column....

Apparently, we can't do data cleaning in Hive...

import pandas as pd

import numpy as np

df = pd.read_csv('vgsales.csv')

df.head(100)

df.isna().sum() #######Before Data Cleaning###########

df = df.dropna()

df.replace(',','_')

df.Year = df.Year.astype('int64')

df.tail(100)

df.to_csv(r'cleaned_vgsales.csv', sep = ',')

import matplotlib.pyplot as plt

import seaborn as sns

from pyhive import hive

conn = hive.connect("quickstart.cloudera", username="cloudera")

cursor = conn.cursor()

cursor.execute('DROP DATABASE IF EXISTS VG CASCADE') # if one was created before

cursor.execute('CREATE DATABASE VG')

cursor.execute("CREATE TABLE IF NOT EXISTS
VG.sale (\

RANK INT,\

NAME STRING,\

PLATFORM STRING,\

YEAR INT,\

GENRE STRING,\

PUBLISHER STRING,\

NA_SALES FLOAT,\

EU_SALES FLOAT,\

JP_SALES FLOAT,\

OTHER_SALES FLOAT,\

GLOBAL_SALES FLOAT)\

ROW FORMAT DELIMITED \

FIELDS TERMINATED BY ','")

cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/cleaned_vgsales.csv' OVERWRITE INTO TABLE
VG.sale")

cursor.execute('ALTER TABLE
VG.sale SET TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line

cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables

cursor.execute('SELECT * FROM
VG.sale') ######AFTER DATA CLEANING########

query = cursor.fetchall()

rows = pd.DataFrame(query, columns=(['RANK','NAME','PLATFORM','YEAR','GENRE','PUBLISHER','NA_SALES','EU_SALES','JP_SALES','OTHER_SALES','GLOBAL_SALES']))

print(rows.count())
Reply
#2
Can you provide your dataset? Or one which has had any proprietary information removed, but which would allow us to reproduce this problem locally?
Reply
#3
This is the original dataset: 'vgsales.csv'
The cleaned dataset: 'cleaned_vgsales.csv'

https://drive.google.com/drive/folders/1...sp=sharing
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  NPL function: data cleaning a variables' categories Netherlands_Hi 1 1,504 Mar-11-2020, 08:20 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020