Weighting The Python KMeans Procedure

zsfeinstein · (This post was last modified: Apr-29-2019, 10:28 PM by micseydel.)

Below is a set of sample code for doing the KMeans Cluster analysis in Python:

sample_dat = pd.DataFrame(np.array([[1,0,1,1,1,5],
                                    [0,0,0,0,1,3],
                                    [1,0,0,0,1,1],
                                    [1,0,0,1,1,1],
                                    [1,0,0,0,1,1],
                                    [1,1,0,0,1,1]]),
          columns=(['var1','var2','var3','var4','var5','cnt'])
sample_dat
sample_dat = sample_dat.drop(['cnt'],axis=1)
sample_dat

sum_of_squared_distances = []
K = range(1,5)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(sample_dat, sample_weight = None)
    sum_of_squared_distances.append(km.inertia_)
print(sum_of_squared_distances)

The key ingredient above is how I am deleting the last column I create above called cnt. It is also relevant as the sample_weight = None, so far. I wish to do the code in such a way so that cnt is NOT included in the cluster analyses, but represents the weighting of the data.

For example, record #1 will be weighted 5 times as much as a standard record. Record #2 will be weighted 3 times as much. And so on...

An example of what the code would look like if weighting does not exist is below. Notice that it does not have as many columns - the cnt column was removed:

sample_dat = pd.DataFrame(np.array([[1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [0,0,0,0,1],
                                    [0,0,0,0,1],
                                    [0,0,0,0,1],
                                    [1,0,0,0,1],
                                    [1,0,0,1,1],
                                    [1,0,0,0,1],
                                    [1,1,0,0,1]]),
              columns=['var1','var2','var3','var4','var5'])
sample_dat

sum_of_squared_distances = []
K = range(1,5)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(sample_dat, sample_weight = None)
    sum_of_squared_distances.append(km.inertia_)   
print(sum_of_squared_distances)

Notice that this has six more rows: 4 more for row #1 and 2 more for row #2.

The sum of squared distances are clearly different. Here where it says sample_weight = None it is appropriate. How may I do this only with the first set of code above?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Library resources and ML procedure	nsadams87xx	3	2,570	Mar-03-2020, 08:02 PM Last Post: nsadams87xx
	why I can't install numpy with this procedure	atlass218	4	13,745	Sep-20-2018, 07:18 AM Last Post: atlass218
	Procedure Entry Point could not be located	ZedsDead	2	4,074	Mar-17-2018, 07:25 AM Last Post: ZedsDead
	ImportError: DLL load failed: The specified procedure could not be found.	chess	1	4,790	Sep-17-2017, 10:12 PM Last Post: Larz60+

Weighting The Python KMeans Procedure

User Panel Messages

Announcements