Python Forum
Find duplicates in a pandas dataframe list column on other rows
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Find duplicates in a pandas dataframe list column on other rows
#1
I have a pandas dataframe where one of the columns is a list of values. I need to identify if any item in the list occurs in that list in other rows.

in other words, on the first row, take the first value in the list in the "items" column and check the "items" column in all the other rows to see if that value occurs. If it does, I want to know the "name" from the first row, the "name" from the row where the match occurs, and the value we were looking for. Repeat for the each value in the "items" list in the first row. Repeat for all the rows.

df = pd.DataFrame([{"name": "abc", "items": ["1234", "5678", "9012"]},
                   {"name": "def", "items": ["3456", "7890"]},
                   {"name": "ghi", "items": ["9876", "1234"]},
                   {"name": "jkl", "items": ["5678", "7890", "2468"]}
                  ])

I want:
    [["abc", "ghi", "1234"],
     ["abc", "jkl", "5678"],
     ["def", "jkl", "7890"]
    ]
I could iterate through the lists, iterating over all the rows, but I feel that pandas should be able to handle at least some of the work here.

What should I be doing to achieve the result I'm looking for?
Reply
#2
(Sep-18-2024, 04:14 PM)Calab Wrote: I could iterate through the lists, iterating over all the rows, but I feel that pandas should be able to handle at least some of the work here.
Yes there is usually better way(think in a vectorized way),as loop/iterating can solve it but performance will usually not good at all in Pandas.
Something like this.
import pandas as pd

df = pd.DataFrame([
    {"name": "abc", "items": ["1234", "5678", "9012"]},
    {"name": "def", "items": ["3456", "7890"]},
    {"name": "ghi", "items": ["9876", "1234"]},
    {"name": "jkl", "items": ["5678", "7890", "2468"]}
])

# Explode the 'items' column
df_exploded = df.explode('items')
df_merged = pd.merge(df_exploded, df_exploded, on='items')
# Exclude rows where the names are the same
df_merged = df_merged[df_merged['name_x'] != df_merged['name_y']]
# Remove duplicate pairs by sorting the names and dropping duplicates
df_merged['sorted_names'] = df_merged.apply(
    lambda row: tuple(sorted([row['name_x'], row['name_y']])),
    axis=1
)
df_unique = df_merged.drop_duplicates(subset=['items', 'sorted_names'])
result = df_unique[['name_x', 'name_y', 'items']].values.tolist()
print(result)
explode function in pandas is not strictly a vectorized method,but is a efficient and practical approach.
Can also mention Polars worth looking into if in future need performance.
Reply
#3
(Sep-18-2024, 06:45 PM)snippsat Wrote: Yes there is usually better way(think in a vectorized way),as loop/iterating can solve it but performance will usually not good at all in Pandas.
Something like this.
import pandas as pd

df = pd.DataFrame([
    {"name": "abc", "items": ["1234", "5678", "9012"]},
    {"name": "def", "items": ["3456", "7890"]},
    {"name": "ghi", "items": ["9876", "1234"]},
    {"name": "jkl", "items": ["5678", "7890", "2468"]}
])

# Explode the 'items' column
df_exploded = df.explode('items')
df_merged = pd.merge(df_exploded, df_exploded, on='items')
# Exclude rows where the names are the same
df_merged = df_merged[df_merged['name_x'] != df_merged['name_y']]
# Remove duplicate pairs by sorting the names and dropping duplicates
df_merged['sorted_names'] = df_merged.apply(
    lambda row: tuple(sorted([row['name_x'], row['name_y']])),
    axis=1
)
df_unique = df_merged.drop_duplicates(subset=['items', 'sorted_names'])
result = df_unique[['name_x', 'name_y', 'items']].values.tolist()
print(result)
explode function in pandas is not strictly a vectorized method,but is a efficient and practical approach.
Can also mention Polars worth looking into if in future need performance.

Thanks... looks good, but I just found out that we only have Pandas v 0.24.2 as this machine is running Python2. explode() isn't an option, but you given me some ideas.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question [Solved] Formatting cells of a pandas dataframe into an OpenDocument ods spreadsheet Calab 1 734 Mar-01-2025, 04:51 AM
Last Post: Calab
  Find strings by index from a list of indexes in a different Pandas dataframe column Calab 3 1,662 Aug-26-2024, 04:52 PM
Last Post: Calab
  Loop over dataframe to fill in missing rows Scott 9 4,662 Jul-12-2024, 05:54 AM
Last Post: Scott
  Create new column in dataframe Scott 10 3,641 Jun-30-2024, 10:18 PM
Last Post: Scott
  attempt to split values from within a dataframe column mbrown009 9 6,023 Jun-20-2024, 07:59 PM
Last Post: AdamHensley
  Putting column name to dataframe, can't work. jonah88888 2 3,296 Jun-18-2024, 09:19 PM
Last Post: AdamHensley
  Add NER output to pandas dataframe dg3000 0 1,174 Apr-22-2024, 08:14 PM
Last Post: dg3000
  How to most effectively unpack list of name-value pair dictionaries in a dataframe? zlim 1 3,004 Nov-07-2023, 10:56 PM
Last Post: zlim
  concat 3 columns of dataframe to one column flash77 2 2,155 Oct-03-2023, 09:29 PM
Last Post: flash77
  HTML Decoder pandas dataframe column mbrown009 3 2,729 Sep-29-2023, 05:56 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020