Python Forum
Clustering based on a variable and on a distance matrix
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Clustering based on a variable and on a distance matrix
#1
I have a dataset with locations (coordinates) and a scalar attribute of each location (for example, temperature). I need to cluster the locations based on the scalar attribute, but taking into consideration the distance between locations.

The problem is that, using temperature as an example, it is possible for locations that are far from each other to have the same temperature. If I cluster on temperature, these locations will be in the same cluster when they shouldn't. The opposite is true if two locations that are near each other have different temperatures. In this case, clustering on temperature may result in these observations being in different clusters, while clustering based on a distance matrix would put them in the same one.

So, is there a way in which I could cluster observations giving more importance to one attribute (temperature) and then "refining" based on the distance matrix?

Here is a simple example showing how clustering differs depending on whether an attribute is used as the basis or the distance matrix. My goal is to be able to use both, the attribute and the distance matrix, giving more importance to the attribute.

import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd

# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)

t = np.random.randint(0, 20, size=(100,1))

# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
    for j in range(len(y)):
        distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
        D[k,j] = distance_pair

# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")

# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)  
plt.show()

# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)  
plt.show()
haversine.py is available here: https://gist.github.com/rochacbruno/2883505

For full disclosure, I posted this question in stackoverflow a couple of days ago but so far I haven't received any feedback.

Thanks
Reply


Messages In This Thread
Clustering based on a variable and on a distance matrix - by flucoe - Dec-13-2018, 05:41 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Stock clustering and portfolio diversification. Suitable features. timurkanaz 1 306 Mar-27-2024, 09:54 AM
Last Post: Larz60+
  K Means Clustering antouanet 0 704 Jan-30-2023, 01:18 PM
Last Post: antouanet
  updating cluster of elements based on the max value of distance alex80 0 1,616 Oct-02-2020, 11:11 AM
Last Post: alex80
  Dropping Rows From A Data Frame Based On A Variable JoeDainton123 1 2,249 Aug-03-2020, 02:05 AM
Last Post: scidam
  ValueError: The condensed distance matrix must contain only finite values. kisumsam 1 4,633 Dec-29-2019, 10:14 AM
Last Post: Larz60+
  matrix name including a variable paul18fr 3 2,378 Nov-16-2019, 03:55 PM
Last Post: paul18fr
  Clustering for imbalanced data sets dervast 0 1,640 Sep-25-2019, 06:34 AM
Last Post: dervast
  Could anyone help me get the jaccard distance between my dataframes please? :) a_real_phoenix 0 1,785 Jun-27-2019, 06:01 PM
Last Post: a_real_phoenix
  text clustering evaluation ?? khalidreemy 1 2,202 May-29-2019, 03:10 AM
Last Post: heiner55
  Sklearn Agglomerative Hierarchical Clustering - help with array set up pstarrett 4 5,349 Feb-21-2017, 05:05 AM
Last Post: pstarrett

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020