Python Forum

Full Version: Create homogeneous groups with Kmeans ?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello to all,

I recently started to study automatic classification using the K-Means method, a method that interests me greatly. For the example, I have a database that lists cheeses as well as different components (calories, lipids, etc.), in this form: https://zupimages.net/viewer.php?id=20/36/imce.png

I wanted to create 4 groups, with the lowest homogeneity (average distance of observations from the center of their respective classes), and the highest dispersion (average distance between classes). I know that statistical software like Sphinx can give these numbers (example of a rendering here: https://zupimages.net/viewer.php?id=20/36/khlr.png).
What I'm thinking of doing is creating a number of group combinations with KMeans, and then only getting the combination that meets the conditions listed. Unfortunately, it was impossible for me to find how to extract this homogeneity and this dispersion, despite my research.

However, my research allowed me to create an algorithm, reproducible:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn import cluster, metrics

data = pd.DataFrame({"fromage" : ["fromage1", "fromage2", "fromage3", "fromage4", "fromage5", "fromage6", "fromage7", "fromage8", "fromage9", "fromage10", "fromage11", "fromage12", "fromage13", "fromage14", "fromage15", "fromage16", "fromage17", "fromage18", "fromage19", "fromage20", "fromage21"], "calories" : np.random.uniform(low=100, high=450, size=(21,)), "sodium" : np.random.uniform(low=20, high=450, size=(21,)), "calcium" : np.random.uniform(low=70, high=250, size=(21,)), "lipides" : np.random.uniform(low=20, high=30, size=(21,)), "retinol" : np.random.uniform(low=50, high=120, size=(21,)), "folates" : np.random.uniform(low=1, high=30, size=(21,)), "proteines" : np.random.uniform(low=7, high=20, size=(21,)), "cholesterol" : np.random.uniform(low=100, high=450, size=(21,))})
#CConvertir l'index
data = data.set_index("fromage")
#Créer mes groupes
kmeans = cluster.KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data)
#index triés des groupes
idk = np.argsort(kmeans.labels_)
#moyenne par variable
m = data.mean()
#TSS
TSS = data.shape[0]*data.var(ddof=0)
#data.frame conditionnellement aux groupes
gb = data.groupby(kmeans.labels_)
#effectifs conditionnels
nk = gb.size()
#MOYENNE DES FACTEURS PAR CLASSE
mk = gb.mean()
#pour chaque groupe écart à la moyenne par variable
EMk = (mk-m)**2
#pondéré par les effectifs du groupe
EM = EMk.multiply(nk,axis=0)
#somme des valeurs => BSS
BSS = np.sum(EM,axis=0)
#variance expliquée par l'appartenance aux groupes pour chaque variable
R2 = BSS/TSS
Is it possible to extract these numbers with one of the libraries that I used?
Thank you.