Python Forum

Below is a set of sample code for doing the KMeans Cluster analysis in Python:

sample_dat = pd.DataFrame(np.array([[1,0,1,1,1,5],
                                    [0,0,0,0,1,3],
                                    [1,0,0,0,1,1],
                                    [1,0,0,1,1,1],
                                    [1,0,0,0,1,1],
                                    [1,1,0,0,1,1]]),
          columns=(['var1','var2','var3','var4','var5','cnt'])
sample_dat
sample_dat = sample_dat.drop(['cnt'],axis=1)
sample_dat

sum_of_squared_distances = []
K = range(1,5)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(sample_dat, sample_weight = None)
    sum_of_squared_distances.append(km.inertia_)
print(sum_of_squared_distances)

The key ingredient above is how I am deleting the last column I create above called cnt. It is also relevant as the sample_weight = None, so far. I wish to do the code in such a way so that cnt is NOT included in the cluster analyses, but represents the weighting of the data.

For example, record #1 will be weighted 5 times as much as a standard record. Record #2 will be weighted 3 times as much. And so on...

An example of what the code would look like if weighting does not exist is below. Notice that it does not have as many columns - the cnt column was removed:

sample_dat = pd.DataFrame(np.array([[1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [0,0,0,0,1],
                                    [0,0,0,0,1],
                                    [0,0,0,0,1],
                                    [1,0,0,0,1],
                                    [1,0,0,1,1],
                                    [1,0,0,0,1],
                                    [1,1,0,0,1]]),
              columns=['var1','var2','var3','var4','var5'])
sample_dat

sum_of_squared_distances = []
K = range(1,5)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(sample_dat, sample_weight = None)
    sum_of_squared_distances.append(km.inertia_)   
print(sum_of_squared_distances)

Notice that this has six more rows: 4 more for row #1 and 2 more for row #2.

The sum of squared distances are clearly different. Here where it says sample_weight = None it is appropriate. How may I do this only with the first set of code above?

zsfeinstein