Distances between all combinations in datatset

amyd · (This post was last modified: Dec-10-2018, 12:38 PM by buran.)

I have a data set with a set of unique ids. Each of these ids has eastings and northings and a dummy variable which can take on the values of 1 or 0. I want to find whether, for each id, there is another id for which the dummy is 1 within 1000 meters. What I am trying to do is generate a list containing all the distances of all the possible combinations and then discarding those which are further than 1000 m and for which the dummy is 0. However, my dataset is quite large (100k observations), so I realise I am creating an incredibly large amount of combinations. Is there a better way of doing this? Essentially, I think I am trying to find a way for which observations that do not meet my criteria are discarded immediately rather than stored in memory.

This is the code I am trying to run:

list_x = df["easting"].tolist()
list_y = df["northing"].tolist()

list_xy = (list(zip(list_x, list_y)))

import math

def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

list=[]
from itertools import combinations

for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    list.append(dist)

***ichabod801*** · Dec-10-2018, 02:38 PM

If you don't want them stored in memory, don't store them in memory. Use a conditional to check before appending them to the list:

for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    if dist > 1000:
        list.append(dist)

Also, wouldn't you want to append the combo with the distance, instead of just the distance? To further save memory, put the dummy variable you are talking about in list_xy, so that you can subset by that before even calculating a distance.

amyd · (This post was last modified: Dec-11-2018, 12:22 PM by amyd.)

(Dec-10-2018, 02:38 PM)ichabod801 Wrote: If you don't want them stored in memory, don't store them in memory. Use a conditional to check before appending them to the list:
for combo in combinations(list_xy,2):
    dist = ([distance(*combo)])
    if dist > 1000:
        list.append(dist)
Also, wouldn't you want to append the combo with the distance, instead of just the distance? To further save memory, put the dummy variable you are talking about in list_xy, so that you can subset by that before even calculating a distance.

Thanks for your reply! yes you are right, what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is. So I think what I need in the list is both the distance but also an identifier, possibly to which id it refers to / which combinations.I do not know if what I am doing is taking me in the right direction, so let me know if you have any advice!

***ichabod801*** · Dec-11-2018, 02:16 PM

(Dec-11-2018, 12:21 PM)amyd Wrote: what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is.

Okay, I didn't understand that. In that case you want two lists. But what distances do you want to save? Do you want to save the minimum distance to between the ID and any other ID? Or do you want to save all the distances under 1000km?

amyd · Dec-11-2018, 03:22 PM

(Dec-11-2018, 02:16 PM)ichabod801 Wrote:
(Dec-11-2018, 12:21 PM)amyd Wrote: what I am trying to achieve is essentially creating a new column in the data which answers the question "does this ID have another ID within 1000 km that has dummy = 1?" and another column that shows what that distance actually is.

Okay, I didn't understand that. In that case you want two lists. But what distances do you want to save? Do you want to save the minimum distance to between the ID and any other ID? Or do you want to save all the distances under 1000km?

I apologize, I should have made it clearer. I just want to save all the distances under 1000m.

***ichabod801*** · Dec-11-2018, 09:39 PM

That's going to be two separate things. First I would make another data set with the two id's and the distance, using something like this:

list_x = df["easting"].tolist()
list_y = df["northing"].tolist()
list_id = df['id'].tolist()
 
list_xy = zip(list_id, list_x, list_y))
 
import math
 
def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
 
distances=[]
valid = set()
from itertools import combinations
 
for first, second in combinations(list_xy,2):
    dist = ([distance(first[1:], second[1:])])
    distances.append((first[0], second[0], dist))
    valid.add(first[0])
    valid.add(second[0])

That will give you a list of tuples of two id's and the distance between them. You can zip that to get three frames to make a dataframe out of. You will also have valid, a set of all the id's with another id within 1000 km. You can use that to make a new column in your original dataframe showing which ones meet that criteria.

amyd · Dec-13-2018, 01:23 PM

(Dec-11-2018, 09:39 PM)ichabod801 Wrote: That's going to be two separate things. First I would make another data set with the two id's and the distance, using something like this:
list_x = df["easting"].tolist()
list_y = df["northing"].tolist()
list_id = df['id'].tolist()
 
list_xy = zip(list_id, list_x, list_y))
 
import math
 
def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)
 
distances=[]
valid = set()
from itertools import combinations
 
for first, second in combinations(list_xy,2):
    dist = ([distance(first[1:], second[1:])])
    distances.append((first[0], second[0], dist))
    valid.add(first[0])
    valid.add(second[0])
That will give you a list of tuples of two id's and the distance between them. You can zip that to get three frames to make a dataframe out of. You will also have valid, a set of all the id's with another id within 1000 km. You can use that to make a new column in your original dataframe showing which ones meet that criteria.

Thanks so much for the advice!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Finding combinations of list of items (30 or so)	LynnS	1	872	Jan-25-2023, 02:57 PM Last Post: deanhystad
	How can I find all combinations with a regular expression?	AlekseyPython	0	1,669	Jun-23-2021, 04:48 PM Last Post: AlekseyPython
	All possible combinations	CODEP	2	1,857	Dec-01-2020, 06:10 PM Last Post: deanhystad
	Triplet Combinations of All Values	quest	2	1,980	Nov-05-2020, 09:22 AM Last Post: quest
	All possible combinations of multiplications	Shreya10o	0	1,665	May-23-2020, 07:45 AM Last Post: Shreya10o
	Do something with all possible combinations of a list	3Pinter	7	4,086	Sep-11-2019, 08:19 AM Last Post: perfringo
	list of string combinations	Skaperen	8	3,330	May-22-2019, 01:18 PM Last Post: Skaperen
	Python to iterate a number of possible combinations	teflon	4	3,935	Apr-24-2019, 03:00 AM Last Post: scidam
	Combinations of list of lists	dannyH	2	3,334	May-14-2018, 09:54 PM Last Post: dannyH
	itertools: combinations	Skaperen	2	3,032	Mar-19-2018, 01:37 AM Last Post: Skaperen

Distances between all combinations in datatset

User Panel Messages

Announcements