Posts: 6
Threads: 2
Joined: Dec 2022
Dec-03-2022, 09:02 PM
(This post was last modified: Dec-03-2022, 09:02 PM by tdbozarth.)
Hi all. I am pretty new to this language. I have a dataframe of 18col x 36k rows containing SKU data. Within this data there are Category and Subcategory columns. Then I have a second dataframe with the valid combinations of Category and Subcategory. What is the best approach for auditing the SKU dataframe to ensure that all Category and Subcategory pairs used are valid according to second dataframe master list? The second dataframe only has about 15 rows of data as it only contains the valid combinations.
Thank you!
Ted
Posts: 1,950
Threads: 8
Joined: Jun 2018
One possible way it to evaluate specific columns to boolean values and then use that for indices (needed to be inverted in order to get indices of False):
import pandas as pd
products = (('fruit', 'apple', 'Granny Smith'),
('fruit', 'pear', 'Red Anjour'),
('car', 'sedan', 'Model S'),
('fruit', 'tomato', 'Cherry'))
audit = (('fruit', 'apple'),
('fruit', 'pear'))
product_df = pd.DataFrame(products)
audit_df = pd.DataFrame(audit)
indices = pd.eval("(product_df[0] in audit_df[0]) & (product_df[1] in audit_df[1])")
print(product_df[~indices])
0 1 2
2 car sedan Model S
3 fruit tomato Cherry
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 6
Threads: 2
Joined: Dec 2022
This looks very promising - I very much appreciate your help. Will try to implement this today!
Posts: 6
Threads: 2
Joined: Dec 2022
Dec-05-2022, 01:01 PM
(This post was last modified: Dec-06-2022, 06:31 PM by Yoriz.
Edit Reason: Added code tags
)
Having issues with this.
here's what I have. I am getting a whole mess of errors I cannot even interpret. But it seems to start breaking-down with the reference to pd.eval. This is not defined anywhere prior and not sure if it needs to be. In addition once this is working it will return the rows that match the master valid data. What I need is the rows that do NOT match that are invalid - so I will also need to essentially know how to say "not in". thank you!
import pandas as pd
skudata = pd.read_csv("SKU Data.csv")
cat_master = pd.read_csv("Valid Categories & Sub-Categories.csv")
indices = pd.eval("(product_df[0] in audit_df[0]) & (product_df[1] in audit_df[1])")
print(skudata[~indices])
Posts: 1,950
Threads: 8
Joined: Jun 2018
(Dec-05-2022, 01:01 PM)tdbozarth Wrote: Having issues with this.
Provide 5-10 rows of sample data, audit data and your code with which you have problem. Then we can have meaningful discussion about errors.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 6
Threads: 2
Joined: Dec 2022
Dec-06-2022, 03:10 PM
(This post was last modified: Dec-06-2022, 03:10 PM by tdbozarth.)
Valid Entries example data:
[U_SOP_CATEGORY], [U_SOP_SUBCATEGORY]
Sparkling - All Other Spec Can,
Sparkling - Std Glass,
Inactive,
Non-SOP, Cups and Lids
Non-SOP, Raw Materials
Main SKU file example data:
[ITEM], [DESCR], [LOC], [XFER], [U_SOP_CATEGORY], [U_SUB_CATEGORY] (12 more irrelevant cols follow)
100278, 12Z CN 6P_COKE, C001, TRUE, Sparkling - 12oz Can w/o 35pk,
Posts: 1,950
Threads: 8
Joined: Jun 2018
Questions regarding sample data:
In valid entries first three records have only U_SOP_CATEGORY and no U_SOP_SUBCATEGORY. Is this correct?
In example data there are 6 column labels but only 5 column values. Is this correct?
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 6
Threads: 2
Joined: Dec 2022
Yes, I put commas at end in an attempt to show blank entries in last cols. In other words a valid value for Category is a blank and same for subcategory.
|