Python Forum
Extracting unique pairs from a data set based on another value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting unique pairs from a data set based on another value
#1
Hi,

I'm unable to find a way to create something that solves a problem that I have.
This is a real life problem in work which I have simplified in an attempt to keep the problem nice and clear.

The problem:

Suppose I have a group of 3 idents (numbers 1, 2 and 3), and wish to compare each of them against 5 other idents (numbers 5, 6, 7, 8 and 9).

Each ident number has lots of other data that I have used to calculate scores that compare each of the 3 idents with the other 5 'comparison' idents.

The dataframe would look like this:

import pandas as pd

data = {'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], \
'id2': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9], \
'score': [30, 19, 29, 25, 14, 27, 26, 24, 23, 12, 20, 21, 16, 15, 17]}

df = pd.DataFrame(data)

print(df)

id1 id2 score
0 1 5 30
1 1 6 19
2 1 7 29
3 1 8 25
4 1 9 14
5 2 5 27
6 2 6 26
7 2 7 24
8 2 8 23
9 2 9 12
10 3 5 20
11 3 6 21
12 3 7 16
13 3 8 15
14 3 9 17

The result required:

The goal is to find each id1 a unique matched id2 based on the highest score.

So the first thing I have done is sort the dataframe into descending scores:

df.sort_values(by='score', ascending=False)

id1 id2 score
0 1 5 30
2 1 7 29
5 2 5 27
6 2 6 26
3 1 8 25
7 2 7 24
8 2 8 23
11 3 6 21
10 3 5 20
1 1 6 19
14 3 9 17
12 3 7 16
13 3 8 15
4 1 9 14
9 2 9 12

Continuing with the task:

The highest score is 30, so id1=1 and id2=5 is the first match.
The second highest score is 29, where id1=1 and id2=7, but id1=1 is already matched, so this row can be discarded.
The third highest score is 27, where id1=2 and id2=5, but id2=5 is already matched, so this row can be discarded.
The fourth highest score is 26, where id1=2 and id2=6, neither of these already have a match, so this row joins the other matched row (where id1=1 and id2=5) and we have our second match.

The process continues to reject matches until the 11th row, where id1=3 and id2=9, where neither of these already have a match, and this row joins the other two rows,(where id1=1 and id2=5 are matched, and id1=2 and id2=6 are matched).

The process will then continue down the rows but will fail to find any more matches.

I also want to add a match_id to the results; so the first match (with the highest score would be match_id=1, the second match would be match_id=2 etc).

Hence, the result would look like:

result = {'id1': [1, 2, 3], \
'id2': [5, 6, 9], \
'score': [30, 26, 17], \
'match_id': [1, 2, 3]}

result = pd.DataFrame(result)

print(result)

id1 id2 score match_id
0 1 5 30 1
1 2 6 26 2
2 3 9 17 3

I would think another possible method would be to create something that deletes any row that has the same id number above it in the dataframe once sorted? Any method that gives the correct result would be fine.

I'm fairly new to python and unfortunately don't know how to solve this problem myself yet, but I am reading and learning more everyday, so if this seems simple please do forgive me.

Any help would be greatly appreciated.
Reply
#2
I'm not sure how it could be done in pandas. I'm also not sure what the match_id does, except act as an auto-increment, but I included it anyway for the fun of it.

Here's how it could be done with a simple loop:
>>> import pandas as pd
>>> data = {'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
... 'id2': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9],
... 'score': [30, 19, 29, 25, 14, 27, 26, 24, 23, 12, 20, 21, 16, 15, 17]}
>>> df = pd.DataFrame(data)
>>> df = df.sort_values(by="score", ascending=False)
>>> df
    id1  id2  score
0     1    5     30
2     1    7     29
5     2    5     27
6     2    6     26
3     1    8     25
7     2    7     24
8     2    8     23
11    3    6     21
10    3    5     20
1     1    6     19
14    3    9     17
12    3    7     16
13    3    8     15
4     1    9     14
9     2    9     12
>>> matches = {"id1": [], "id2": [], "score": [], "match_id": []}
>>> for row in df.values:
...   id1, id2, score = row
...   if id1 not in matches["id1"] and id2 not in matches["id2"]:
...     matches["id1"].append(id1)
...     matches["id2"].append(id2)
...     matches["score"].append(score)
...     matches["match_id"].append(len(matches["match_id"])+1)
...
>>> matches
{'id1': [1, 2, 3], 'id2': [5, 6, 9], 'score': [30, 26, 17], 'match_id': [1, 2, 3]}
>>> new_df = pd.DataFrame(matches)
>>> new_df
   id1  id2  score  match_id
0    1    5     30         1
1    2    6     26         2
2    3    9     17         3
rybina likes this post
Reply
#3
Hi nilamo,

this has worked perfectly and really helped me out.

Thank you so much.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Improve Prime Pairs Code jzakiya 7 1,588 Mar-09-2025, 08:19 PM
Last Post: jzakiya
  Extracting data from bank statement PDFs (Accountant) a4avinash 4 5,756 Feb-27-2025, 01:53 PM
Last Post: griffinhenry
  Changing client.get() method type based on size of data... dl0dth 1 730 Jan-02-2025, 08:30 PM
Last Post: dl0dth
Star Pairs Trading Simulation Kiitoos 2 1,936 Aug-21-2024, 09:43 AM
Last Post: Larz60+
  Confused by the different ways of extracting data in DataFrame leea2024 1 698 Aug-17-2024, 01:34 PM
Last Post: deanhystad
  Extracting the correct data from a CSV file S2G 6 1,858 Jun-03-2024, 04:50 PM
Last Post: snippsat
  Sample random, unique string pairs from a list without repetitions walterwhite 1 1,920 Nov-19-2023, 10:07 PM
Last Post: deanhystad
  Grouping Data based on 30% bracket purnima1 4 2,177 Mar-10-2023, 07:38 PM
Last Post: deanhystad
  Extracting Data into Columns using pdfplumber arvin 17 17,341 Dec-17-2022, 11:59 AM
Last Post: arvin
  conditionals based on data frame mbrown009 1 1,581 Aug-12-2022, 08:18 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020