Python Forum - Distances between all combinations in datatset

I have a data set with a set of unique ids. Each of these ids has eastings and northings and a dummy variable which can take on the values of 1 or 0. I want to find whether, for each id, there is another id for which the dummy is 1 within 1000 meters. What I am trying to do is generate a list containing all the distances of all the possible combinations and then discarding those which are further than 1000 m and for which the dummy is 0. However, my dataset is quite large (100k observations), so I realise I am creating an incredibly large amount of combinations. Is there a better way of doing this? Essentially, I think I am trying to find a way for which observations that do not meet my criteria are discarded immediately rather than stored in memory.

This is the code I am trying to run:

list_x = df["easting"].tolist()
list_y = df["northing"].tolist()

list_xy = (list(zip(list_x, list_y)))

import math

def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

list=[]
from itertools import combinations

for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    list.append(dist)

If you don't want them stored in memory, don't store them in memory. Use a conditional to check before appending them to the list:

for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    if dist > 1000:
        list.append(dist)

Also, wouldn't you want to append the combo with the distance, instead of just the distance? To further save memory, put the dummy variable you are talking about in list_xy, so that you can subset by that before even calculating a distance.

(Dec-10-2018, 02:38 PM)ichabod801 Wrote: [ -> ]If you don't want them stored in memory, don't store them in memory. Use a conditional to check before appending them to the list:
for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    if dist > 1000:
        list.append(dist)
Also, wouldn't you want to append the combo with the distance, instead of just the distance? To further save memory, put the dummy variable you are talking about in list_xy, so that you can subset by that before even calculating a distance.

Thanks for your reply! yes you are right, what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is. So I think what I need in the list is both the distance but also an identifier, possibly to which id it refers to / which combinations.I do not know if what I am doing is taking me in the right direction, so let me know if you have any advice!

(Dec-11-2018, 12:21 PM)amyd Wrote: [ -> ]what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is.

Okay, I didn't understand that. In that case you want two lists. But what distances do you want to save? Do you want to save the minimum distance to between the ID and any other ID? Or do you want to save all the distances under 1000km?

(Dec-11-2018, 02:16 PM)ichabod801 Wrote: [ -> ]
(Dec-11-2018, 12:21 PM)amyd Wrote: [ -> ]what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is.

Okay, I didn't understand that. In that case you want two lists. But what distances do you want to save? Do you want to save the minimum distance to between the ID and any other ID? Or do you want to save all the distances under 1000km?

I apologize, I should have made it clearer. I just want to save all the distances under 1000m.

That's going to be two separate things. First I would make another data set with the two id's and the distance, using something like this:

list_x = df["easting"].tolist()
list_y = df["northing"].tolist()
list_id = df['id'].tolist()
 
list_xy = zip(list_id, list_x, list_y))
 
import math
 
def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
 
distances=[]
valid = set()
from itertools import combinations
 
for first, second in combinations(list_xy,2):
    dist = ([distance(first[1:], second[1:])])
    distances.append((first[0], second[0], dist))
    valid.add(first[0])
    valid.add(second[0])

That will give you a list of tuples of two id's and the distance between them. You can zip that to get three frames to make a dataframe out of. You will also have valid, a set of all the id's with another id within 1000 km. You can use that to make a new column in your original dataframe showing which ones meet that criteria.

(Dec-11-2018, 09:39 PM)ichabod801 Wrote: [ -> ]That's going to be two separate things. First I would make another data set with the two id's and the distance, using something like this:
list_x = df["easting"].tolist()
list_y = df["northing"].tolist()
list_id = df['id'].tolist()
 
list_xy = zip(list_id, list_x, list_y))
 
import math
 
def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
 
distances=[]
valid = set()
from itertools import combinations
 
for first, second in combinations(list_xy,2):
    dist = ([distance(first[1:], second[1:])])
    distances.append((first[0], second[0], dist))
    valid.add(first[0])
    valid.add(second[0])
That will give you a list of tuples of two id's and the distance between them. You can zip that to get three frames to make a dataframe out of. You will also have valid, a set of all the id's with another id within 1000 km. You can use that to make a new column in your original dataframe showing which ones meet that criteria.

Thanks so much for the advice!