Extracting unique pairs from a data set based on another value

rybina · Feb-11-2021, 09:19 AM

Hi,

I'm unable to find a way to create something that solves a problem that I have.
This is a real life problem in work which I have simplified in an attempt to keep the problem nice and clear.

The problem:

Suppose I have a group of 3 idents (numbers 1, 2 and 3), and wish to compare each of them against 5 other idents (numbers 5, 6, 7, 8 and 9).

Each ident number has lots of other data that I have used to calculate scores that compare each of the 3 idents with the other 5 'comparison' idents.

The dataframe would look like this:

import pandas as pd

data = {'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], \
'id2': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9], \
'score': [30, 19, 29, 25, 14, 27, 26, 24, 23, 12, 20, 21, 16, 15, 17]}

df = pd.DataFrame(data)

print(df)

id1 id2 score
0 1 5 30
1 1 6 19
2 1 7 29
3 1 8 25
4 1 9 14
5 2 5 27
6 2 6 26
7 2 7 24
8 2 8 23
9 2 9 12
10 3 5 20
11 3 6 21
12 3 7 16
13 3 8 15
14 3 9 17

The result required:

The goal is to find each id1 a unique matched id2 based on the highest score.

So the first thing I have done is sort the dataframe into descending scores:

df.sort_values(by='score', ascending=False)

id1 id2 score
0 1 5 30
2 1 7 29
5 2 5 27
6 2 6 26
3 1 8 25
7 2 7 24
8 2 8 23
11 3 6 21
10 3 5 20
1 1 6 19
14 3 9 17
12 3 7 16
13 3 8 15
4 1 9 14
9 2 9 12

Continuing with the task:

The highest score is 30, so id1=1 and id2=5 is the first match.
The second highest score is 29, where id1=1 and id2=7, but id1=1 is already matched, so this row can be discarded.
The third highest score is 27, where id1=2 and id2=5, but id2=5 is already matched, so this row can be discarded.
The fourth highest score is 26, where id1=2 and id2=6, neither of these already have a match, so this row joins the other matched row (where id1=1 and id2=5) and we have our second match.

The process continues to reject matches until the 11th row, where id1=3 and id2=9, where neither of these already have a match, and this row joins the other two rows,(where id1=1 and id2=5 are matched, and id1=2 and id2=6 are matched).

The process will then continue down the rows but will fail to find any more matches.

I also want to add a match_id to the results; so the first match (with the highest score would be match_id=1, the second match would be match_id=2 etc).

Hence, the result would look like:

result = {'id1': [1, 2, 3], \
'id2': [5, 6, 9], \
'score': [30, 26, 17], \
'match_id': [1, 2, 3]}

result = pd.DataFrame(result)

print(result)

id1 id2 score match_id
0 1 5 30 1
1 2 6 26 2
2 3 9 17 3

I would think another possible method would be to create something that deletes any row that has the same id number above it in the dataframe once sorted? Any method that gives the correct result would be fine.

I'm fairly new to python and unfortunately don't know how to solve this problem myself yet, but I am reading and learning more everyday, so if this seems simple please do forgive me.

Any help would be greatly appreciated.

**nilamo** · Feb-11-2021, 10:21 PM

I'm not sure how it could be done in pandas. I'm also not sure what the match_id does, except act as an auto-increment, but I included it anyway for the fun of it.

Here's how it could be done with a simple loop:

>>> import pandas as pd
>>> data = {'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
... 'id2': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9],
... 'score': [30, 19, 29, 25, 14, 27, 26, 24, 23, 12, 20, 21, 16, 15, 17]}
>>> df = pd.DataFrame(data)
>>> df = df.sort_values(by="score", ascending=False)
>>> df
    id1  id2  score
0     1    5     30
2     1    7     29
5     2    5     27
6     2    6     26
3     1    8     25
7     2    7     24
8     2    8     23
11    3    6     21
10    3    5     20
1     1    6     19
14    3    9     17
12    3    7     16
13    3    8     15
4     1    9     14
9     2    9     12
>>> matches = {"id1": [], "id2": [], "score": [], "match_id": []}
>>> for row in df.values:
...   id1, id2, score = row
...   if id1 not in matches["id1"] and id2 not in matches["id2"]:
...     matches["id1"].append(id1)
...     matches["id2"].append(id2)
...     matches["score"].append(score)
...     matches["match_id"].append(len(matches["match_id"])+1)
...
>>> matches
{'id1': [1, 2, 3], 'id2': [5, 6, 9], 'score': [30, 26, 17], 'match_id': [1, 2, 3]}
>>> new_df = pd.DataFrame(matches)
>>> new_df
   id1  id2  score  match_id
0    1    5     30         1
1    2    6     26         2
2    3    9     17         3

rybina · Feb-12-2021, 08:36 AM

Hi nilamo,

this has worked perfectly and really helped me out.

Thank you so much.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Improve Prime Pairs Code	jzakiya	7	1,588	Mar-09-2025, 08:19 PM Last Post: jzakiya
	Extracting data from bank statement PDFs (Accountant)	a4avinash	4	5,756	Feb-27-2025, 01:53 PM Last Post: griffinhenry
	Changing client.get() method type based on size of data...	dl0dth	1	730	Jan-02-2025, 08:30 PM Last Post: dl0dth
	Pairs Trading Simulation	Kiitoos	2	1,936	Aug-21-2024, 09:43 AM Last Post: Larz60+
	Confused by the different ways of extracting data in DataFrame	leea2024	1	698	Aug-17-2024, 01:34 PM Last Post: deanhystad
	Extracting the correct data from a CSV file	S2G	6	1,858	Jun-03-2024, 04:50 PM Last Post: snippsat
	Sample random, unique string pairs from a list without repetitions	walterwhite	1	1,920	Nov-19-2023, 10:07 PM Last Post: deanhystad
	Grouping Data based on 30% bracket	purnima1	4	2,177	Mar-10-2023, 07:38 PM Last Post: deanhystad
	Extracting Data into Columns using pdfplumber	arvin	17	17,341	Dec-17-2022, 11:59 AM Last Post: arvin
	conditionals based on data frame	mbrown009	1	1,581	Aug-12-2022, 08:18 AM Last Post: Larz60+

Extracting unique pairs from a data set based on another value

User Panel Messages

Announcements