Dec-08-2021, 04:42 AM
The Python code I have written removes the intended number of rows from a data frame, but they are not the rows I wish to remove. I am using Python 3.9 on a Windows 10 64-bit OS. I have examined my code intensively and conducted extensive searches of Google and Stack Overflow with no success.
I have attached a copy of my Jupyter Notebook script in the form of screenshots and will reference the lines of code and their corresponding screenshot files throughout this post.
Here is the School Quality Reports dataset from which I am trying to remove the rows, containing 1,238 rows in total (In [4], screenshot_2):
https://data.cityofnewyork.us/Education/.../9cz6-8qpz
I used the following code (In [5], screenshot_3) to generate a subset containing all rows which have null values for the 'Quality Review Rating' column. This code outputs 292 rows with unique values for 'DBN', the column of interest at index position 0:
https://data.cityofnewyork.us/Education/.../gcvr-n8qw (In [2], screenshot_1)
https://data.cityofnewyork.us/Education/.../jk35-yh5p (In [3], screenshot_1)
This determination was made by 1) merging unique 'DBN' values from the two primary datasets into a single data frame and 2) merging this new data frame with the 292-row subset. I used the following code (In [6], screenshot_3):
I used the following code (In [9], screenshot_5) to remove the 140 rows:
Please advise and thank you in advance.
I have attached a copy of my Jupyter Notebook script in the form of screenshots and will reference the lines of code and their corresponding screenshot files throughout this post.
Here is the School Quality Reports dataset from which I am trying to remove the rows, containing 1,238 rows in total (In [4], screenshot_2):
https://data.cityofnewyork.us/Education/.../9cz6-8qpz
I used the following code (In [5], screenshot_3) to generate a subset containing all rows which have null values for the 'Quality Review Rating' column. This code outputs 292 rows with unique values for 'DBN', the column of interest at index position 0:
school_quality_2013_2014_nulls = school_quality_2013_2014[school_quality_2013_2014['Quality Review Rating'].isnull()].copy() school_quality_2013_2014_nullsMy next step was to determine which of these 292 'DBN' values were located in the two primary datasets I am using - one describing Mathematics examination scores and the other describing English Language Arts (ELA) examination scores:
https://data.cityofnewyork.us/Education/.../gcvr-n8qw (In [2], screenshot_1)
https://data.cityofnewyork.us/Education/.../jk35-yh5p (In [3], screenshot_1)
This determination was made by 1) merging unique 'DBN' values from the two primary datasets into a single data frame and 2) merging this new data frame with the 292-row subset. I used the following code (In [6], screenshot_3):
math_unique_DBN = pd.DataFrame({'DBN':math_exam_2013_2015['DBN'].unique()}) ela_unique_DBN = pd.DataFrame({'DBN':ela_exam_2013_2015['DBN'].unique()}) merged_exam_DBN = pd.merge(math_unique_DBN, ela_unique_DBN, on=['DBN'], how='inner') merged_exam_nulls = pd.merge(merged_exam_DBN, school_quality_2013_2014_nulls, on=['DBN'], how='inner') merged_exam_nullsThe code above outputs 152 rows, indicating that 152 of the DBN values from the School Quality Reports dataset are also contained in the two primary datasets. This leaves 140 rows to be removed from the School Quality Reports data frame. To retrieve these 140 rows, I used the following code (In [7], screenshot_3):
school_quality_2013_2014_nulls=school_quality_2013_2014_nulls.reset_index(drop=True) school_quality_2013_2014_nulls.drop(merged_exam_nulls.index, inplace=True) school_quality_2013_2014_nullsThe last two rows of output are shown in screenshot_4 (In [8]):
I used the following code (In [9], screenshot_5) to remove the 140 rows:
school_quality_2013_2014.drop(school_quality_2013_2014_nulls.index, inplace=True) school_quality_2013_2014The above code reduced the School Quality Reports data frame to 1,098 rows, indicating that 140 rows were removed, but it did not remove the intended rows. For instance, the charter schools in the last rows of the data frame -- DBN beginning with '84' -- should have been removed, but they were not (In [9, 10], screenshot_5).
Please advise and thank you in advance.