Python Forum
Distances between all combinations in datatset
Thread Rating:
  • 1 Vote(s) - 2 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Distances between all combinations in datatset
#1
I have a data set with a set of unique ids. Each of these ids has eastings and northings and a dummy variable which can take on the values of 1 or 0. I want to find whether, for each id, there is another id for which the dummy is 1 within 1000 meters. What I am trying to do is generate a list containing all the distances of all the possible combinations and then discarding those which are further than 1000 m and for which the dummy is 0. However, my dataset is quite large (100k observations), so I realise I am creating an incredibly large amount of combinations. Is there a better way of doing this? Essentially, I think I am trying to find a way for which observations that do not meet my criteria are discarded immediately rather than stored in memory.

This is the code I am trying to run:

list_x = df["easting"].tolist()
list_y = df["northing"].tolist()

list_xy = (list(zip(list_x, list_y)))

import math

def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

list=[]
from itertools import combinations

for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    list.append(dist)
Reply
#2
If you don't want them stored in memory, don't store them in memory. Use a conditional to check before appending them to the list:

for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    if dist > 1000:
        list.append(dist)
Also, wouldn't you want to append the combo with the distance, instead of just the distance? To further save memory, put the dummy variable you are talking about in list_xy, so that you can subset by that before even calculating a distance.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
(Dec-10-2018, 02:38 PM)ichabod801 Wrote: If you don't want them stored in memory, don't store them in memory. Use a conditional to check before appending them to the list:

for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    if dist > 1000:
        list.append(dist)
Also, wouldn't you want to append the combo with the distance, instead of just the distance? To further save memory, put the dummy variable you are talking about in list_xy, so that you can subset by that before even calculating a distance.

Thanks for your reply! yes you are right, what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is. So I think what I need in the list is both the distance but also an identifier, possibly to which id it refers to / which combinations.I do not know if what I am doing is taking me in the right direction, so let me know if you have any advice!
Reply
#4
(Dec-11-2018, 12:21 PM)amyd Wrote: what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is.

Okay, I didn't understand that. In that case you want two lists. But what distances do you want to save? Do you want to save the minimum distance to between the ID and any other ID? Or do you want to save all the distances under 1000km?
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#5
(Dec-11-2018, 02:16 PM)ichabod801 Wrote:
(Dec-11-2018, 12:21 PM)amyd Wrote: what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is.

Okay, I didn't understand that. In that case you want two lists. But what distances do you want to save? Do you want to save the minimum distance to between the ID and any other ID? Or do you want to save all the distances under 1000km?

I apologize, I should have made it clearer. I just want to save all the distances under 1000m.
Reply
#6
That's going to be two separate things. First I would make another data set with the two id's and the distance, using something like this:

list_x = df["easting"].tolist()
list_y = df["northing"].tolist()
list_id = df['id'].tolist()
 
list_xy = zip(list_id, list_x, list_y))
 
import math
 
def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
 
distances=[]
valid = set()
from itertools import combinations
 
for first, second in combinations(list_xy,2):
    dist = ([distance(first[1:], second[1:])])
    distances.append((first[0], second[0], dist))
    valid.add(first[0])
    valid.add(second[0])
That will give you a list of tuples of two id's and the distance between them. You can zip that to get three frames to make a dataframe out of. You will also have valid, a set of all the id's with another id within 1000 km. You can use that to make a new column in your original dataframe showing which ones meet that criteria.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#7
(Dec-11-2018, 09:39 PM)ichabod801 Wrote: That's going to be two separate things. First I would make another data set with the two id's and the distance, using something like this:

list_x = df["easting"].tolist()
list_y = df["northing"].tolist()
list_id = df['id'].tolist()
 
list_xy = zip(list_id, list_x, list_y))
 
import math
 
def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
 
distances=[]
valid = set()
from itertools import combinations
 
for first, second in combinations(list_xy,2):
    dist = ([distance(first[1:], second[1:])])
    distances.append((first[0], second[0], dist))
    valid.add(first[0])
    valid.add(second[0])
That will give you a list of tuples of two id's and the distance between them. You can zip that to get three frames to make a dataframe out of. You will also have valid, a set of all the id's with another id within 1000 km. You can use that to make a new column in your original dataframe showing which ones meet that criteria.

Thanks so much for the advice!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Finding combinations of list of items (30 or so) LynnS 1 872 Jan-25-2023, 02:57 PM
Last Post: deanhystad
  How can I find all combinations with a regular expression? AlekseyPython 0 1,669 Jun-23-2021, 04:48 PM
Last Post: AlekseyPython
  All possible combinations CODEP 2 1,857 Dec-01-2020, 06:10 PM
Last Post: deanhystad
  Triplet Combinations of All Values quest 2 1,980 Nov-05-2020, 09:22 AM
Last Post: quest
  All possible combinations of multiplications Shreya10o 0 1,665 May-23-2020, 07:45 AM
Last Post: Shreya10o
  Do something with all possible combinations of a list 3Pinter 7 4,086 Sep-11-2019, 08:19 AM
Last Post: perfringo
  list of string combinations Skaperen 8 3,330 May-22-2019, 01:18 PM
Last Post: Skaperen
  Python to iterate a number of possible combinations teflon 4 3,935 Apr-24-2019, 03:00 AM
Last Post: scidam
  Combinations of list of lists dannyH 2 3,334 May-14-2018, 09:54 PM
Last Post: dannyH
  itertools: combinations Skaperen 2 3,032 Mar-19-2018, 01:37 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020