Clustering based on a variable and on a distance matrix

flucoe · Dec-13-2018, 05:41 PM

I have a dataset with locations (coordinates) and a scalar attribute of each location (for example, temperature). I need to cluster the locations based on the scalar attribute, but taking into consideration the distance between locations.

The problem is that, using temperature as an example, it is possible for locations that are far from each other to have the same temperature. If I cluster on temperature, these locations will be in the same cluster when they shouldn't. The opposite is true if two locations that are near each other have different temperatures. In this case, clustering on temperature may result in these observations being in different clusters, while clustering based on a distance matrix would put them in the same one.

So, is there a way in which I could cluster observations giving more importance to one attribute (temperature) and then "refining" based on the distance matrix?

Here is a simple example showing how clustering differs depending on whether an attribute is used as the basis or the distance matrix. My goal is to be able to use both, the attribute and the distance matrix, giving more importance to the attribute.

        
          
          
              
              import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd
 
# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
 
t = np.random.randint(0, 20, size=(100,1))
 
# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
    for j in range(len(y)):
        distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
        D[k,j] = distance_pair
 
# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")
 
# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)  
plt.show()
 
# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)  
plt.show()

            

        
      

haversine.py is available here: https://gist.github.com/rochacbruno/2883505

For full disclosure, I posted this question in stackoverflow a couple of days ago but so far I haven't received any feedback.

Thanks

**scidam** · Dec-16-2018, 01:30 AM

Hi!

If you want to take into account coordinates along with temperatures, you probably need to use
custom distance, e.g. weighted distance:

        
              import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd
from scipy.spatial.distance import pdist, cdist
 
# copied from haversine.py 
import math
def haversine_distance(origin, destination):
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 6371 # km
 
    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c
 
    return d 
  
def compute_distance_matrix(X, temp_weight=None, coord_weight=None):
    D = pdist(X[:, :2], lambda x, y: haversine_distance(x, y))  
    T = pdist(X, lambda x, y: abs(x[-1] - y[-1])) # pairwise differences by temperature
    if temp_weight is None or coord_weight is None: # if at least one of the weights isn't defined, use coordinates only
        return D
    return temp_weight * T + coord_weight * D
        
np.random.seed(30) 
  
# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
  
t = np.random.randint(0, 20, size=(100,1))
 
X = np.hstack([x, y, t])
 
# Compare clustering alternatives
distance_matrix = compute_distance_matrix(X, temp_weight=100, coord_weight=5)
Zd = linkage(distance_matrix, method="complete")
  
# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)  
plt.show()

You need to make some experiments with temp_weight and coord_weight values
to get optimal result. The greater the temp_weight value, the bigger impact of temperature on
the cluster structure.

flucoe · Dec-16-2018, 09:57 PM

This is great, thank you!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Stock clustering and portfolio diversification. Suitable features.	timurkanaz	1	1,171	Mar-27-2024, 09:54 AM Last Post: Larz60+
	K Means Clustering	antouanet	0	1,245	Jan-30-2023, 01:18 PM Last Post: antouanet
	updating cluster of elements based on the max value of distance	alex80	0	2,157	Oct-02-2020, 11:11 AM Last Post: alex80
	Dropping Rows From A Data Frame Based On A Variable	JoeDainton123	1	2,912	Aug-03-2020, 02:05 AM Last Post: scidam
	ValueError: The condensed distance matrix must contain only finite values.	kisumsam	1	5,924	Dec-29-2019, 10:14 AM Last Post: Larz60+
	matrix name including a variable	paul18fr	3	3,500	Nov-16-2019, 03:55 PM Last Post: paul18fr
	Clustering for imbalanced data sets	dervast	0	2,162	Sep-25-2019, 06:34 AM Last Post: dervast
	Could anyone help me get the jaccard distance between my dataframes please? :)	a_real_phoenix	0	2,403	Jun-27-2019, 06:01 PM Last Post: a_real_phoenix
	text clustering evaluation ??	khalidreemy	1	2,843	May-29-2019, 03:10 AM Last Post: heiner55
	Sklearn Agglomerative Hierarchical Clustering - help with array set up	pstarrett	4	6,425	Feb-21-2017, 05:05 AM Last Post: pstarrett

Clustering based on a variable and on a distance matrix

User Panel Messages

Announcements