Find duplicates in a pandas dataframe list column on other rows

Calab · Sep-18-2024, 04:14 PM

I have a pandas dataframe where one of the columns is a list of values. I need to identify if any item in the list occurs in that list in other rows.

in other words, on the first row, take the first value in the list in the "items" column and check the "items" column in all the other rows to see if that value occurs. If it does, I want to know the "name" from the first row, the "name" from the row where the match occurs, and the value we were looking for. Repeat for the each value in the "items" list in the first row. Repeat for all the rows.

df = pd.DataFrame([{"name": "abc", "items": ["1234", "5678", "9012"]},
                   {"name": "def", "items": ["3456", "7890"]},
                   {"name": "ghi", "items": ["9876", "1234"]},
                   {"name": "jkl", "items": ["5678", "7890", "2468"]}
                  ])

I want:
    [["abc", "ghi", "1234"],
     ["abc", "jkl", "5678"],
     ["def", "jkl", "7890"]
    ]

I could iterate through the lists, iterating over all the rows, but I feel that pandas should be able to handle at least some of the work here.

What should I be doing to achieve the result I'm looking for?

***snippsat*** · (This post was last modified: Sep-18-2024, 06:48 PM by snippsat.)

(Sep-18-2024, 04:14 PM)Calab Wrote: I could iterate through the lists, iterating over all the rows, but I feel that pandas should be able to handle at least some of the work here.

Yes there is usually better way(think in a vectorized way),as loop/iterating can solve it but performance will usually not good at all in Pandas.
Something like this.

import pandas as pd

df = pd.DataFrame([
    {"name": "abc", "items": ["1234", "5678", "9012"]},
    {"name": "def", "items": ["3456", "7890"]},
    {"name": "ghi", "items": ["9876", "1234"]},
    {"name": "jkl", "items": ["5678", "7890", "2468"]}
])

# Explode the 'items' column
df_exploded = df.explode('items')
df_merged = pd.merge(df_exploded, df_exploded, on='items')
# Exclude rows where the names are the same
df_merged = df_merged[df_merged['name_x'] != df_merged['name_y']]
# Remove duplicate pairs by sorting the names and dropping duplicates
df_merged['sorted_names'] = df_merged.apply(
    lambda row: tuple(sorted([row['name_x'], row['name_y']])),
    axis=1
)
df_unique = df_merged.drop_duplicates(subset=['items', 'sorted_names'])
result = df_unique[['name_x', 'name_y', 'items']].values.tolist()
print(result)

explode function in pandas is not strictly a vectorized method,but is a efficient and practical approach.
Can also mention Polars worth looking into if in future need performance.

Calab · Sep-18-2024, 07:38 PM

(Sep-18-2024, 06:45 PM)snippsat Wrote: Yes there is usually better way(think in a vectorized way),as loop/iterating can solve it but performance will usually not good at all in Pandas.
Something like this.
import pandas as pd

df = pd.DataFrame([
    {"name": "abc", "items": ["1234", "5678", "9012"]},
    {"name": "def", "items": ["3456", "7890"]},
    {"name": "ghi", "items": ["9876", "1234"]},
    {"name": "jkl", "items": ["5678", "7890", "2468"]}
])

# Explode the 'items' column
df_exploded = df.explode('items')
df_merged = pd.merge(df_exploded, df_exploded, on='items')
# Exclude rows where the names are the same
df_merged = df_merged[df_merged['name_x'] != df_merged['name_y']]
# Remove duplicate pairs by sorting the names and dropping duplicates
df_merged['sorted_names'] = df_merged.apply(
    lambda row: tuple(sorted([row['name_x'], row['name_y']])),
    axis=1
)
df_unique = df_merged.drop_duplicates(subset=['items', 'sorted_names'])
result = df_unique[['name_x', 'name_y', 'items']].values.tolist()
print(result)
explode function in pandas is not strictly a vectorized method,but is a efficient and practical approach.
Can also mention Polars worth looking into if in future need performance.

Thanks... looks good, but I just found out that we only have Pandas v 0.24.2 as this machine is running Python2. explode() isn't an option, but you given me some ideas.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[Solved] Formatting cells of a pandas dataframe into an OpenDocument ods spreadsheet	Calab	1	734	Mar-01-2025, 04:51 AM Last Post: Calab
	Find strings by index from a list of indexes in a different Pandas dataframe column	Calab	3	1,662	Aug-26-2024, 04:52 PM Last Post: Calab
	Loop over dataframe to fill in missing rows	Scott	9	4,662	Jul-12-2024, 05:54 AM Last Post: Scott
	Create new column in dataframe	Scott	10	3,641	Jun-30-2024, 10:18 PM Last Post: Scott
	attempt to split values from within a dataframe column	mbrown009	9	6,023	Jun-20-2024, 07:59 PM Last Post: AdamHensley
	Putting column name to dataframe, can't work.	jonah88888	2	3,296	Jun-18-2024, 09:19 PM Last Post: AdamHensley
	Add NER output to pandas dataframe	dg3000	0	1,174	Apr-22-2024, 08:14 PM Last Post: dg3000
	How to most effectively unpack list of name-value pair dictionaries in a dataframe?	zlim	1	3,004	Nov-07-2023, 10:56 PM Last Post: zlim
	concat 3 columns of dataframe to one column	flash77	2	2,155	Oct-03-2023, 09:29 PM Last Post: flash77
	HTML Decoder pandas dataframe column	mbrown009	3	2,729	Sep-29-2023, 05:56 PM Last Post: deanhystad

Find duplicates in a pandas dataframe list column on other rows

User Panel Messages

Announcements