Python Forum
Weighting The Python KMeans Procedure
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Weighting The Python KMeans Procedure
#1
Below is a set of sample code for doing the KMeans Cluster analysis in Python:

sample_dat = pd.DataFrame(np.array([[1,0,1,1,1,5],
                                    [0,0,0,0,1,3],
                                    [1,0,0,0,1,1],
                                    [1,0,0,1,1,1],
                                    [1,0,0,0,1,1],
                                    [1,1,0,0,1,1]]),
          columns=(['var1','var2','var3','var4','var5','cnt'])
sample_dat
sample_dat = sample_dat.drop(['cnt'],axis=1)
sample_dat

sum_of_squared_distances = []
K = range(1,5)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(sample_dat, sample_weight = None)
    sum_of_squared_distances.append(km.inertia_)
print(sum_of_squared_distances)
The key ingredient above is how I am deleting the last column I create above called cnt. It is also relevant as the sample_weight = None, so far. I wish to do the code in such a way so that cnt is NOT included in the cluster analyses, but represents the weighting of the data.

For example, record #1 will be weighted 5 times as much as a standard record. Record #2 will be weighted 3 times as much. And so on...

An example of what the code would look like if weighting does not exist is below. Notice that it does not have as many columns - the cnt column was removed:

sample_dat = pd.DataFrame(np.array([[1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [1,0,1,1,1],
                                    [0,0,0,0,1],
                                    [0,0,0,0,1],
                                    [0,0,0,0,1],
                                    [1,0,0,0,1],
                                    [1,0,0,1,1],
                                    [1,0,0,0,1],
                                    [1,1,0,0,1]]),
              columns=['var1','var2','var3','var4','var5'])
sample_dat

sum_of_squared_distances = []
K = range(1,5)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(sample_dat, sample_weight = None)
    sum_of_squared_distances.append(km.inertia_)   
print(sum_of_squared_distances)
Notice that this has six more rows: 4 more for row #1 and 2 more for row #2.

The sum of squared distances are clearly different. Here where it says sample_weight = None it is appropriate. How may I do this only with the first set of code above?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Library resources and ML procedure nsadams87xx 3 1,846 Mar-03-2020, 08:02 PM
Last Post: nsadams87xx
  why I can't install numpy with this procedure atlass218 4 12,535 Sep-20-2018, 07:18 AM
Last Post: atlass218
  Procedure Entry Point could not be located ZedsDead 2 3,202 Mar-17-2018, 07:25 AM
Last Post: ZedsDead
  ImportError: DLL load failed: The specified procedure could not be found. chess 1 4,160 Sep-17-2017, 10:12 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020