Dataframe Comparison

tdbozarth · (This post was last modified: Dec-03-2022, 09:02 PM by tdbozarth.)

Hi all. I am pretty new to this language. I have a dataframe of 18col x 36k rows containing SKU data. Within this data there are Category and Subcategory columns. Then I have a second dataframe with the valid combinations of Category and Subcategory. What is the best approach for auditing the SKU dataframe to ensure that all Category and Subcategory pairs used are valid according to second dataframe master list? The second dataframe only has about 15 rows of data as it only contains the valid combinations.

Thank you!
Ted

**perfringo** · Dec-04-2022, 10:03 AM

One possible way it to evaluate specific columns to boolean values and then use that for indices (needed to be inverted in order to get indices of False):

import pandas as pd

products = (('fruit', 'apple', 'Granny Smith'),
            ('fruit', 'pear', 'Red Anjour'),
            ('car', 'sedan', 'Model S'),
            ('fruit', 'tomato', 'Cherry'))


audit = (('fruit', 'apple'),
         ('fruit', 'pear'))

product_df = pd.DataFrame(products)
audit_df = pd.DataFrame(audit)

indices = pd.eval("(product_df[0] in audit_df[0]) & (product_df[1] in audit_df[1])")

print(product_df[~indices])

       0       1        2
2    car   sedan  Model S
3  fruit  tomato   Cherry

tdbozarth · Dec-05-2022, 12:44 PM

This looks very promising - I very much appreciate your help. Will try to implement this today!

tdbozarth

Having issues with this.

here's what I have. I am getting a whole mess of errors I cannot even interpret. But it seems to start breaking-down with the reference to pd.eval. This is not defined anywhere prior and not sure if it needs to be. In addition once this is working it will return the rows that match the master valid data. What I need is the rows that do NOT match that are invalid - so I will also need to essentially know how to say "not in". thank you!

import pandas as pd
skudata = pd.read_csv("SKU Data.csv")
cat_master = pd.read_csv("Valid Categories & Sub-Categories.csv")

indices = pd.eval("(product_df[0] in audit_df[0]) & (product_df[1] in audit_df[1])")
print(skudata[~indices])

Yoriz write Dec-06-2022, 06:31 PM:
Please post all code, output and errors (in their entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.

**perfringo** · Dec-06-2022, 08:32 AM

(Dec-05-2022, 01:01 PM)tdbozarth Wrote: Having issues with this.

Provide 5-10 rows of sample data, audit data and your code with which you have problem. Then we can have meaningful discussion about errors.

tdbozarth · (This post was last modified: Dec-06-2022, 03:10 PM by tdbozarth.)

Valid Entries example data:

[U_SOP_CATEGORY], [U_SOP_SUBCATEGORY]
Sparkling - All Other Spec Can,
Sparkling - Std Glass,
Inactive,
Non-SOP, Cups and Lids
Non-SOP, Raw Materials

Main SKU file example data:

[ITEM], [DESCR], [LOC], [XFER], [U_SOP_CATEGORY], [U_SUB_CATEGORY] (12 more irrelevant cols follow)
100278, 12Z CN 6P_COKE, C001, TRUE, Sparkling - 12oz Can w/o 35pk,

**perfringo** · Dec-06-2022, 03:45 PM

Questions regarding sample data:

In valid entries first three records have only U_SOP_CATEGORY and no U_SOP_SUBCATEGORY. Is this correct?

In example data there are 6 column labels but only 5 column values. Is this correct?

tdbozarth · Dec-06-2022, 03:55 PM

Yes, I put commas at end in an attempt to show blank entries in last cols. In other words a valid value for Category is a blank and same for subcategory.

Dataframe Comparison

User Panel Messages

Announcements