Extracting unique pairs from a data set based on another value

rybina · Feb-11-2021, 09:19 AM

Hi,

I'm unable to find a way to create something that solves a problem that I have.
This is a real life problem in work which I have simplified in an attempt to keep the problem nice and clear.

The problem:

Suppose I have a group of 3 idents (numbers 1, 2 and 3), and wish to compare each of them against 5 other idents (numbers 5, 6, 7, 8 and 9).

Each ident number has lots of other data that I have used to calculate scores that compare each of the 3 idents with the other 5 'comparison' idents.

The dataframe would look like this:

import pandas as pd

data = {'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], \
'id2': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9], \
'score': [30, 19, 29, 25, 14, 27, 26, 24, 23, 12, 20, 21, 16, 15, 17]}

df = pd.DataFrame(data)

print(df)

id1 id2 score
0 1 5 30
1 1 6 19
2 1 7 29
3 1 8 25
4 1 9 14
5 2 5 27
6 2 6 26
7 2 7 24
8 2 8 23
9 2 9 12
10 3 5 20
11 3 6 21
12 3 7 16
13 3 8 15
14 3 9 17

The result required:

The goal is to find each id1 a unique matched id2 based on the highest score.

So the first thing I have done is sort the dataframe into descending scores:

df.sort_values(by='score', ascending=False)

id1 id2 score
0 1 5 30
2 1 7 29
5 2 5 27
6 2 6 26
3 1 8 25
7 2 7 24
8 2 8 23
11 3 6 21
10 3 5 20
1 1 6 19
14 3 9 17
12 3 7 16
13 3 8 15
4 1 9 14
9 2 9 12

Continuing with the task:

The highest score is 30, so id1=1 and id2=5 is the first match.
The second highest score is 29, where id1=1 and id2=7, but id1=1 is already matched, so this row can be discarded.
The third highest score is 27, where id1=2 and id2=5, but id2=5 is already matched, so this row can be discarded.
The fourth highest score is 26, where id1=2 and id2=6, neither of these already have a match, so this row joins the other matched row (where id1=1 and id2=5) and we have our second match.

The process continues to reject matches until the 11th row, where id1=3 and id2=9, where neither of these already have a match, and this row joins the other two rows,(where id1=1 and id2=5 are matched, and id1=2 and id2=6 are matched).

The process will then continue down the rows but will fail to find any more matches.

I also want to add a match_id to the results; so the first match (with the highest score would be match_id=1, the second match would be match_id=2 etc).

Hence, the result would look like:

result = {'id1': [1, 2, 3], \
'id2': [5, 6, 9], \
'score': [30, 26, 17], \
'match_id': [1, 2, 3]}

result = pd.DataFrame(result)

print(result)

id1 id2 score match_id
0 1 5 30 1
1 2 6 26 2
2 3 9 17 3

I would think another possible method would be to create something that deletes any row that has the same id number above it in the dataframe once sorted? Any method that gives the correct result would be fine.

I'm fairly new to python and unfortunately don't know how to solve this problem myself yet, but I am reading and learning more everyday, so if this seems simple please do forgive me.

Any help would be greatly appreciated.

**nilamo** · Feb-11-2021, 10:21 PM

I'm not sure how it could be done in pandas. I'm also not sure what the match_id does, except act as an auto-increment, but I included it anyway for the fun of it.

Here's how it could be done with a simple loop:

>>> import pandas as pd
>>> data = {'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
... 'id2': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9],
... 'score': [30, 19, 29, 25, 14, 27, 26, 24, 23, 12, 20, 21, 16, 15, 17]}
>>> df = pd.DataFrame(data)
>>> df = df.sort_values(by="score", ascending=False)
>>> df
    id1  id2  score
0     1    5     30
2     1    7     29
5     2    5     27
6     2    6     26
3     1    8     25
7     2    7     24
8     2    8     23
11    3    6     21
10    3    5     20
1     1    6     19
14    3    9     17
12    3    7     16
13    3    8     15
4     1    9     14
9     2    9     12
>>> matches = {"id1": [], "id2": [], "score": [], "match_id": []}
>>> for row in df.values:
...   id1, id2, score = row
...   if id1 not in matches["id1"] and id2 not in matches["id2"]:
...     matches["id1"].append(id1)
...     matches["id2"].append(id2)
...     matches["score"].append(score)
...     matches["match_id"].append(len(matches["match_id"])+1)
...
>>> matches
{'id1': [1, 2, 3], 'id2': [5, 6, 9], 'score': [30, 26, 17], 'match_id': [1, 2, 3]}
>>> new_df = pd.DataFrame(matches)
>>> new_df
   id1  id2  score  match_id
0    1    5     30         1
1    2    6     26         2
2    3    9     17         3

rybina · Feb-12-2021, 08:36 AM

Hi nilamo,

this has worked perfectly and really helped me out.

Thank you so much.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pairs Trading Simulation	Kiitoos	0	234	Feb-19-2024, 08:27 PM Last Post: Kiitoos
	Sample random, unique string pairs from a list without repetitions	walterwhite	1	449	Nov-19-2023, 10:07 PM Last Post: deanhystad
	Grouping Data based on 30% bracket	purnima1	4	1,192	Mar-10-2023, 07:38 PM Last Post: deanhystad
	Extracting Data into Columns using pdfplumber	arvin	17	5,521	Dec-17-2022, 11:59 AM Last Post: arvin
	conditionals based on data frame	mbrown009	1	893	Aug-12-2022, 08:18 AM Last Post: Larz60+
	I have written a program that outputs data based on GPS signal	kalle	1	1,166	Jul-22-2022, 12:10 AM Last Post: mcmxl22
	Change elements of array based on position of input data	Cola_Reb	6	2,113	May-13-2022, 12:57 PM Last Post: Cola_Reb
	Extracting Specific Lines from text file based on content.	jokerfmj	8	2,955	Mar-28-2022, 03:38 PM Last Post: snippsat
	Pairs of multiplied prime number--->N	Frankduc	13	3,514	Jan-16-2022, 01:52 PM Last Post: Frankduc
	Extracting Data from tables	DataExtrator	0	1,141	Nov-02-2021, 12:24 PM Last Post: DataExtrator

Extracting unique pairs from a data set based on another value

User Panel Messages

Announcements