Python Forum
Extracting unique pairs from a data set based on another value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting unique pairs from a data set based on another value
#1
Hi,

I'm unable to find a way to create something that solves a problem that I have.
This is a real life problem in work which I have simplified in an attempt to keep the problem nice and clear.

The problem:

Suppose I have a group of 3 idents (numbers 1, 2 and 3), and wish to compare each of them against 5 other idents (numbers 5, 6, 7, 8 and 9).

Each ident number has lots of other data that I have used to calculate scores that compare each of the 3 idents with the other 5 'comparison' idents.

The dataframe would look like this:

import pandas as pd

data = {'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], \
'id2': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9], \
'score': [30, 19, 29, 25, 14, 27, 26, 24, 23, 12, 20, 21, 16, 15, 17]}

df = pd.DataFrame(data)

print(df)

id1 id2 score
0 1 5 30
1 1 6 19
2 1 7 29
3 1 8 25
4 1 9 14
5 2 5 27
6 2 6 26
7 2 7 24
8 2 8 23
9 2 9 12
10 3 5 20
11 3 6 21
12 3 7 16
13 3 8 15
14 3 9 17

The result required:

The goal is to find each id1 a unique matched id2 based on the highest score.

So the first thing I have done is sort the dataframe into descending scores:

df.sort_values(by='score', ascending=False)

id1 id2 score
0 1 5 30
2 1 7 29
5 2 5 27
6 2 6 26
3 1 8 25
7 2 7 24
8 2 8 23
11 3 6 21
10 3 5 20
1 1 6 19
14 3 9 17
12 3 7 16
13 3 8 15
4 1 9 14
9 2 9 12

Continuing with the task:

The highest score is 30, so id1=1 and id2=5 is the first match.
The second highest score is 29, where id1=1 and id2=7, but id1=1 is already matched, so this row can be discarded.
The third highest score is 27, where id1=2 and id2=5, but id2=5 is already matched, so this row can be discarded.
The fourth highest score is 26, where id1=2 and id2=6, neither of these already have a match, so this row joins the other matched row (where id1=1 and id2=5) and we have our second match.

The process continues to reject matches until the 11th row, where id1=3 and id2=9, where neither of these already have a match, and this row joins the other two rows,(where id1=1 and id2=5 are matched, and id1=2 and id2=6 are matched).

The process will then continue down the rows but will fail to find any more matches.

I also want to add a match_id to the results; so the first match (with the highest score would be match_id=1, the second match would be match_id=2 etc).

Hence, the result would look like:

result = {'id1': [1, 2, 3], \
'id2': [5, 6, 9], \
'score': [30, 26, 17], \
'match_id': [1, 2, 3]}

result = pd.DataFrame(result)

print(result)

id1 id2 score match_id
0 1 5 30 1
1 2 6 26 2
2 3 9 17 3

I would think another possible method would be to create something that deletes any row that has the same id number above it in the dataframe once sorted? Any method that gives the correct result would be fine.

I'm fairly new to python and unfortunately don't know how to solve this problem myself yet, but I am reading and learning more everyday, so if this seems simple please do forgive me.

Any help would be greatly appreciated.
Reply
#2
I'm not sure how it could be done in pandas. I'm also not sure what the match_id does, except act as an auto-increment, but I included it anyway for the fun of it.

Here's how it could be done with a simple loop:
>>> import pandas as pd
>>> data = {'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
... 'id2': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9],
... 'score': [30, 19, 29, 25, 14, 27, 26, 24, 23, 12, 20, 21, 16, 15, 17]}
>>> df = pd.DataFrame(data)
>>> df = df.sort_values(by="score", ascending=False)
>>> df
    id1  id2  score
0     1    5     30
2     1    7     29
5     2    5     27
6     2    6     26
3     1    8     25
7     2    7     24
8     2    8     23
11    3    6     21
10    3    5     20
1     1    6     19
14    3    9     17
12    3    7     16
13    3    8     15
4     1    9     14
9     2    9     12
>>> matches = {"id1": [], "id2": [], "score": [], "match_id": []}
>>> for row in df.values:
...   id1, id2, score = row
...   if id1 not in matches["id1"] and id2 not in matches["id2"]:
...     matches["id1"].append(id1)
...     matches["id2"].append(id2)
...     matches["score"].append(score)
...     matches["match_id"].append(len(matches["match_id"])+1)
...
>>> matches
{'id1': [1, 2, 3], 'id2': [5, 6, 9], 'score': [30, 26, 17], 'match_id': [1, 2, 3]}
>>> new_df = pd.DataFrame(matches)
>>> new_df
   id1  id2  score  match_id
0    1    5     30         1
1    2    6     26         2
2    3    9     17         3
rybina likes this post
Reply
#3
Hi nilamo,

this has worked perfectly and really helped me out.

Thank you so much.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Star Pairs Trading Simulation Kiitoos 0 234 Feb-19-2024, 08:27 PM
Last Post: Kiitoos
  Sample random, unique string pairs from a list without repetitions walterwhite 1 449 Nov-19-2023, 10:07 PM
Last Post: deanhystad
  Grouping Data based on 30% bracket purnima1 4 1,192 Mar-10-2023, 07:38 PM
Last Post: deanhystad
  Extracting Data into Columns using pdfplumber arvin 17 5,521 Dec-17-2022, 11:59 AM
Last Post: arvin
  conditionals based on data frame mbrown009 1 893 Aug-12-2022, 08:18 AM
Last Post: Larz60+
  I have written a program that outputs data based on GPS signal kalle 1 1,166 Jul-22-2022, 12:10 AM
Last Post: mcmxl22
Question Change elements of array based on position of input data Cola_Reb 6 2,113 May-13-2022, 12:57 PM
Last Post: Cola_Reb
  Extracting Specific Lines from text file based on content. jokerfmj 8 2,955 Mar-28-2022, 03:38 PM
Last Post: snippsat
  Pairs of multiplied prime number--->N Frankduc 13 3,514 Jan-16-2022, 01:52 PM
Last Post: Frankduc
  Extracting Data from tables DataExtrator 0 1,141 Nov-02-2021, 12:24 PM
Last Post: DataExtrator

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020