Python Forum

Full Version: Validating Dataframe Using Second Dataframe
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello all,

I am new to Python but not new to programming. I have a datafile of 28k rows x 18cols (skudata) that I am loading into a dataframe (cat_master) in order to do various data quality checks. One of the checks is to compare the Category and Subcategory columns from skudata dataframe to ensure that the combination of Category and Subcategory is a valid entry stored in the cat_master dataframe. the cat_master dataframe only has these two columns (also named Category and Subcategory.

The result I want is the rows in skudata whose category and subcategory to NOT match the master list in the cat_master dataframe. Keep in mind it is the combination of the Category and Subcategory in skudata that need to match the combination of Category and Subcategory in cat_master in order to be considered a valid row.

Here's what I have in terms of setup but need help in doing the actual "selection" of invalid rows in skudata.
import pandas as pd
skudata = pd.read_csv("S&OP SKU Data.csv")
cat_master = pd.read_csv("Valid Categories & Subcategories")
What do I need to do now in order to select and display only the rows in skudata where the category & subcategory combo does not exist in cat_master?

thank you!