Create dataframe from the unique data of two dataframes

Calab · (This post was last modified: Mar-01-2025, 05:21 PM by Calab.)

This is a bit complicated, so excuse me if I'm not clear enough...

I have two dataframes with the following columns:

"Date","Acct","Type","Amount","Category","Note","Extra"

In the first dataframe the "Category" and "Note" columns will always be empty.
In the second dataframe, these columns may or may not be empty.

I want only the rows in the first dataframe if there is not a match on the "Date", "Acct", "Type", "Amount" and "Extra" columns on the rows of the second dataframe. There can be duplicated rows in dataframe 1 and I want to keep all that do not match dataframe 2.

Basically, remove any row in the first dataframe that matches any row of the second dataframe, ignoring the "Category" and "Note" columns.

e.g.:

Columns:
"Date",     "Acct",  "Type",   "Amt",  "Category",  "Note",  "Extra"

Dataframe 1:
2025-02-27, 12345,   Transfer, 123.45, ,            ,        Description of transaction   <- This would be removed because the columns I want match the second row of the other dataframe
2025-02-28, 98765,   Purchase, 2.33,   ,            ,        Coffee on debit  <- This is kept because the columns I want don't match any rows in the other dataframe


Dataframe 2:
2025-02-28, 90210,   Deposit,  100.00, ,            ,        ATM Deposit
2025-02-27, 12345,   Transfer, 123.45, Utilities,   Gas,     Description of transaction 
2025-02-28, 98765,   Purchase, 5.99,   Dining,      ,        Coffee on debit

Resulting dataframe:
2025-02-28, 98765,   Purchase, 2.33,   ,            ,        Coffee on debit

I don't know the markup for a monospace block of text, so I used the code tag.

[edit] correction: Changed "or" to "and". Added duplicate requirement.[/edit]

**Larz60+** · Mar-01-2025, 06:34 AM

I have been playing around with grok on 'X' lately.
I submitted your problem and got the following answer which may or may not work, (haven't tried) :
=======================================
To solve your query, you want to select rows from the first dataframe (df1) that do not have a matching row in the second dataframe (df2) based on the columns "Date", "Acct", "Type", "Amount", and "Extra". Both dataframes have the columns "Date", "Acct", "Type", "Amount", "Category", "Note", and "Extra", but in df1, the "Category" and "Note" columns are always empty, while in df2, these columns may contain values or be empty. Since "Category" and "Note" are not part of the matching criteria, we will focus only on the specified columns for determining matches.
In Python, using the pandas library, we can achieve this by performing an anti-join operation. An anti-join keeps rows from df1 that do not have corresponding matches in df2 based on the specified columns. Here’s how we can do it step-by-step:

Solution Approach
1. Perform a Left Merge:
Merge df1 with df2 using a left join on the columns "Date", "Acct", "Type", "Amount", and "Extra". A left join includes all rows from df1, and adds matching rows from df2 where they exist. If no match is found in df2, the columns from df2 will be filled with NaN for that row in the merged dataframe.

Use the indicator=True parameter in the merge function. This adds a special column called _merge that indicates whether each row is present only in the left dataframe (left_only), only in the right dataframe (right_only), or in both (both). Since we’re doing a left join, only left_only and both will appear.

2. Filter Non-Matching Rows:
After the merge, the _merge column will show left_only for rows in df1 that did not find a match in df2, and both for rows that did find a match. We want only the rows where _merge is left_only, as these are the rows in df1 with no match in df2.

Use this boolean condition to filter df1 directly, since the merged dataframe preserves the order and number of rows from df1.

3. Result:
The filtered result will be a dataframe containing only the rows from df1 that do not have matching entries in df2 based on the specified columns.

Python Code
Here is the complete code to accomplish this:

import pandas as pd

# Assuming df1 and df2 are your dataframes
merged = df1.merge(df2, on=["Date", "Acct", "Type", "Amount", "Extra"], how='left', indicator=True)
result = df1[merged['_merge'] == 'left_only']

Explanation of the Code
o df1.merge(df2, on=[...], how='left', indicator=True):
- on=["Date", "Acct", "Type", "Amount", "Extra"]: Specifies the columns to match on.

- how='left': Performs a left join, keeping all rows from df1.

- indicator=True: Adds the _merge column to track the source of each row.

o merged['_merge'] == 'left_only':
- Creates a boolean mask where True indicates rows from df1 that have no match in df2.

o df1[merged['_merge'] == 'left_only']:
- Filters df1 using the boolean mask, returning only the non-matching rows with their original columns.

Why This Works
o The left join ensures that every row in df1 is represented in the merged dataframe.

o The _merge column distinguishes between rows that found a match (both) and those that did not (left_only).

o Since "Category" and "Note" are not used in the on parameter, they do not affect the matching process. In the result, these columns will remain empty as they are in df1.

o The final result dataframe retains the original structure of df1, including only the rows that meet your condition.

Edge Cases
o If df2 is empty: All rows in df1 will have _merge as left_only, so result will include all rows from df1, which is correct.

o NaN in key columns: In pandas, NaN does not match NaN. If any of "Date", "Acct", "Type", "Amount", or "Extra" contain NaN in df1, those rows will not match any row in df2 and will be included in result. This is the default behavior and aligns with typical data comparison needs unless specified otherwise.

This solution efficiently gives you the rows from the first dataframe that lack matches in the second dataframe based on the specified columns.
=======================================
This is a very interesting AI. It has created very well written code for several promlems I have presented it with.

Fun, if nothing else.

Pedroski55 · Mar-01-2025, 07:14 AM

Try this, seems to work on the scant data you provided:

import pandas as pd

columns = ["Date", "Acct", "Type", "Amt", "Category", "Note", "Extra"]
 
data1 = """2025-02-27,12345,Transfer,123.45,,,Description of transaction
2025-02-28,98765,Purchase,2.33,,,Coffee on debit"""

list1 = [s.split(',') for s in data1.split('\n')] 
# check the lengths, should be 7 for 7 columns easy to get that wrong
for l in list1:
    print(len(l))

df1 = pd.DataFrame(list1, columns=columns)

data2 = """2025-02-28,90210,Deposit,100.00,,,ATM Deposit
2025-02-27,12345,Transfer,123.45,Utilities,Gas,Description of transaction 
2025-02-28,98765,Purchase,5.99,Dining,,Coffee on debit"""

list2 = [s.split(',') for s in data2.split('\n')] 
# check the lengths, should be 7 for 7 columns easy to get that wrong
for l in list2:
    print(len(l))

df2 = pd.DataFrame(list2, columns=columns)

res = df1[~df1.isin(df2)].dropna()

Gives this:

Output:res
         Date   Acct      Type   Amt Category Note            Extra
1  2025-02-28  98765  Purchase  2.33                Coffee on debit

Calab · Mar-01-2025, 04:57 PM

(Mar-01-2025, 07:14 AM)Pedroski55 Wrote: Try this, seems to work on the scant data you provided:

...
Output:res
         Date   Acct      Type   Amt Category Note            Extra
1  2025-02-28  98765  Purchase  2.33                Coffee on debit

Sorry for the limited test data. I didn't want to overfill my post.

That appears to work, but it appears to drop any row where a single column is duplicated. What I was looking for was if all the columns were duplicated, except for the "Category" and "Note" column.

I did not explain it very well in my post. I probably confused myself while writing the post.

It looks like I made a small mistake: "...there is not a match on the "Date", "Acct", "Type", "Amount" or "Extra" columns ...". That "or" should have been an "and".

Calab · Mar-01-2025, 05:19 PM

Just to clarify...

Drop any row in dataframe 1 that is the same as any row in dataframe 2, ignoring the "Category" and "Note" columns.

Something I realized during testing today. There CAN be duplicated rows in dataframe 1 and I want to keep all the duplicated rows if they don't match dataframe 2.

I did edit my original post to make some corrections in my requirements.

Calab · (This post was last modified: Mar-05-2025, 05:51 PM by Calab.)

I decided to break up my problem and iterate through my dataframes manually. Probably not the best way to do this, but it seems to work.

# Remove any duplicated data       
new_data = []

for new_idx, new_row in df1.iterrows():
    for old_idx, old_row in df2.iterrows():
        if (new_row['Date'] == old_row['Date']
            and new_row['Acct'] == old_row['Acct']
            and new_row['Type'] == old_row['Type']
            and new_row['Amt'] == old_row['Amt']
            and new_row['Extra'] == old_row['Extra']
        ):
            break
    else:
        new_data.append({'Date': new_row['Date'],
                            'Acct': new_row['Acct'],
                            'Type': new_row['Type'],
                            'Amt': new_row['Amt'],
                            'Category': new_row['Category'],
                            'Note': new_row['Note'],
                            'Extra': new_row['Extra'],
                            }
                        )

df_new_data = pd.DataFrame(new_data)

[edit] Corrected Amt column name [/edit]

Pedroski55 · Mar-02-2025, 01:51 PM

Yeah, that's all pandas is doing: checking things in lists. Each row or column is a list, Just, you can't see behind the scenes, so you don't see that, then it seems like arcane magic.

I would point out that you do not have a column "Amount", you posted "Amt", so you will get an error there, like I did first time around.

Quote:Columns:
"Date", "Acct", "Type", "Amt", "Category", "Note", "Extra"

Still like this one-liner:

res = df1[~df1.isin(df2)].dropna()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Create new column in dataframe	Scott	10	3,939	Jun-30-2024, 10:18 PM Last Post: Scott
	Make unique id in vectorized way based on text data column with similarity scoring	ill8	0	1,506	Dec-12-2022, 03:22 AM Last Post: ill8
	How to insert data in a dataframe?	man0s	1	2,008	Apr-26-2022, 11:36 PM Last Post: jefsummers
	Filter data based on a value from another dataframe column and create a file using lo	pawanmtm	1	5,282	Jul-15-2020, 06:20 PM Last Post: pawanmtm
	Not able to figure out how to create bar plot on aggregate data - Python	darpInd	1	3,008	Mar-30-2020, 11:37 AM Last Post: jefsummers
	Merging two DataFrames based on indexes from two other DataFrames	lucinda_rigeitti	0	2,421	Jan-16-2020, 08:36 PM Last Post: lucinda_rigeitti
	Sample based on the distribution of a feature to create more balanced data set	dervast	1	2,426	Nov-15-2019, 12:25 AM Last Post: schuler
	datetime intervals - dataframe selection (via plotted data)	karlito	0	2,153	Nov-12-2019, 08:16 AM Last Post: karlito
	How to add data to the categorical index of dataframe as data arrives?	AlekseyPython	1	3,100	Oct-16-2019, 06:26 AM Last Post: AlekseyPython
	Create dataframe through Dictionary in pandas	ift38375	2	3,087	Aug-11-2019, 01:09 AM Last Post: boring_accountant

Create dataframe from the unique data of two dataframes

User Panel Messages

Announcements