Appending DBSCAN clusters to original input?

Madraykin · Jan-02-2019, 04:53 PM

Hi everyone,

Feeling pretty proud of myself, as a Python newbie I've managed to reduce my massive dataset down using t-SNE and then clustered it using DBSCAN (it has taken a lot of blood, sweat and tears but I've managed it!).

The only issue I have now is that I don't think it's possible to view the 'clusters' that my original data fits into. To try and explain - I imported a csv in originally via Pandas, the TSNE function within sklearn then reduced the data and produced a 2d Numpy array which I was then able to feed into the DBSCAN function, giving me 12 distinct clusters which I have been able to scatter plot and am happy with the results.

What I would love to be able to do (but am not sure if it's possible) is to add a column to my initial input data (from the csv) called 'Clusters' and has a number between 1 and 12 in the column indicating which cluster that line of data is aligned to. I've not really added a new column before and am unsure how to go about it and also what I need to specify to populate that column.

The code is quite lengthy (by my standards) and I do it in bits to test things out, if you guys need to see any specific parts to help you to help me just let me know and I'll extract them. Or of course if you need to see everything let me know - I'm new to this!

Any help appreciated, and I'll try to answer any questions you might have

Happy New Year all!

Mads

stullis · Jan-03-2019, 05:04 AM

I'm sure we'd be glad to help you out, but we'll need to see your code. If I'm looking at the correct source code for DBSCAN, the function returns some data with labels. You could store those data in a pair of variables to view them. I'm not sure what those data structures look like so I cannot explain how to join the cluster identifier as a column to the base data.

Madraykin · (This post was last modified: Jan-03-2019, 08:50 AM by Madraykin.)

Will try my best - will extract the code now, thanks so much!

Right, here goes! Remember I'm a newbie - any criticism of the code welcomed so I can grow and learn!

import pandas as pd
data = pd.read_csv('MY CSV PATH HERE')
import numpy as np
persona = np.array(data)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
from matplotlib import pyplot as plt

persona_2d = tsne.fit_transform(persona)

import sys
import time
import argparse
import traceback
from sklearn.neighbors import NearestNeighbors
from scipy.stats import chisquare as sci_chisquare
from scipy.stats import poisson

persona3 = np.array(persona_2d, dtype='float64')

from sklearn.cluster import DBSCAN

dbscan.fit(persona3)

from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

centers = [[1, 1], [-1, -1], [1, -1]]
persona3, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0)
persona3 = StandardScaler().fit_transform(persona3)

db = DBSCAN(eps=0.1, min_samples=10).fit(persona3)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(persona3, labels))

unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = persona3[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = persona3[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Thankful to this site for helping me work out how to do the code https://scikit-learn.org/stable/auto_exa...bscan.html

I've probably got some hideous mistakes up there, so any feedback welcomed - I'm really enjoying learning Python and am not averse to being ripped apart, as that's generally the way I learn (ie. getting things really wrong first!)

I hope this is what you guys need to help, thanks so much!

stullis · Jan-03-2019, 10:10 PM

I'm glad you're comfortable with critique.

Imports should always be at the top of the code and not intermixed with the rest of the code. It's also recommended that they be organized with standard library imports first, then third party imports, and finally local (personal) imports.

I see three instances of variables named variants of "persona" and only one of them truly gets used in the code. The first two are only used to build up to the third one. So, I consolidated those into a single line with nested functions. The variable persona3 is now just persona.

Likewise, I removed centers and used the associated list as an argument only. Generally, if a value is only going to be used once as an argument to a function, there isn't value in creating a variable for it. Variables are for storing values that will be used repeatedly or to obviate a function being performed repeatedly with no changes (as in a loop).

As for the data you want to append to your data set, that is currently stored in db on line 32. It's in one of the following (I'm not sure which one): db.components_, db.labels_, or db.core_sample_indices_. The source code for DBSCAN.fit() sets those three attributes.

import sys
import time
import argparse
import traceback

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
from scipy.stats import chisquare as sci_chisquare
from scipy.stats import poisson
from sklearn import metrics
from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
from sklearn.manifold import TSNE
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('MY CSV PATH HERE')
tsne = TSNE(n_components=2, random_state=0)
persona = np.array(tsne.fit_transform(np.array(data)), dtype='float64')
dbscan.fit(persona) # dbscan is not instantiated above, what is it?

persona, labels_true = make_blobs(
    n_samples=750,
    centers=[[1, 1], [-1, -1], [1, -1]],
    cluster_std=0.4,
    random_state=0
)
persona = StandardScaler().fit_transform(persona)

db = DBSCAN(eps=0.1, min_samples=10).fit(persona)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
n_clusters_ = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % list(db.labels_).count(-1))
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, db.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, db.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, db.labels_))
print("Adjusted Rand Index: %0.3f" % metrics.adjusted_rand_score(labels_true, db.labels_))
print("Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(labels_true, db.labels_))
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(persona, db.labels_))

unique_labels = set(db.labels_)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):
    if k == -1:
        col = [0, 0, 0, 1] # Black used for noise.

    class_member_mask = (db.labels_ == k) # Isn't this always False?

    xy = persona[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = persona[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Madraykin · (This post was last modified: Jan-04-2019, 09:29 AM by Madraykin.)

Thanks so much stullis - yep, my code is a mess, everything above is in separate run sections in my Jupyter notebook (as I tend to forget to do things, hence the imports here there and everywhere). I think a lot of the redundant things you point out are bits and pieces that I tried to see if they would work and when they didn't I didn't remove or adapt them successfully, but there are a great many things you point out that have been a good learning curve for me. I'm teaching myself Python, so it's great to be 'peer-reviewed' in such a way!

Thanks so much for taking a look, for the work to correct my mess, and for the critique, much appreciated, I'll work with this today and report back!

Madraykin · Jan-04-2019, 01:31 PM

Just a follow up - this is working like a dream! I'm playing around with parameters a lot more easily and hopefully will have something workable to scale up soon!

You're a star, very grateful!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pandas - updating the original dataframe with sub selects	Ecniv	0	2,023	Jun-21-2019, 02:12 PM Last Post: Ecniv

Appending DBSCAN clusters to original input?

User Panel Messages

Announcements