Hello to all,
I recently started to study automatic classification using the K-Means method, a method that interests me greatly. For the example, I have a database that lists cheeses as well as different components (calories, lipids, etc.), in this form: https://zupimages.net/viewer.php?id=20/36/imce.png
I wanted to create 4 groups, with the lowest homogeneity (average distance of observations from the center of their respective classes), and the highest dispersion (average distance between classes). I know that statistical software like Sphinx can give these numbers (example of a rendering here: https://zupimages.net/viewer.php?id=20/36/khlr.png).
What I'm thinking of doing is creating a number of group combinations with KMeans, and then only getting the combination that meets the conditions listed. Unfortunately, it was impossible for me to find how to extract this homogeneity and this dispersion, despite my research.
However, my research allowed me to create an algorithm, reproducible:
Thank you.
I recently started to study automatic classification using the K-Means method, a method that interests me greatly. For the example, I have a database that lists cheeses as well as different components (calories, lipids, etc.), in this form: https://zupimages.net/viewer.php?id=20/36/imce.png
I wanted to create 4 groups, with the lowest homogeneity (average distance of observations from the center of their respective classes), and the highest dispersion (average distance between classes). I know that statistical software like Sphinx can give these numbers (example of a rendering here: https://zupimages.net/viewer.php?id=20/36/khlr.png).
What I'm thinking of doing is creating a number of group combinations with KMeans, and then only getting the combination that meets the conditions listed. Unfortunately, it was impossible for me to find how to extract this homogeneity and this dispersion, despite my research.
However, my research allowed me to create an algorithm, reproducible:
import pandas as pd import numpy as np from matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage, fcluster from sklearn import cluster, metrics data = pd.DataFrame({"fromage" : ["fromage1", "fromage2", "fromage3", "fromage4", "fromage5", "fromage6", "fromage7", "fromage8", "fromage9", "fromage10", "fromage11", "fromage12", "fromage13", "fromage14", "fromage15", "fromage16", "fromage17", "fromage18", "fromage19", "fromage20", "fromage21"], "calories" : np.random.uniform(low=100, high=450, size=(21,)), "sodium" : np.random.uniform(low=20, high=450, size=(21,)), "calcium" : np.random.uniform(low=70, high=250, size=(21,)), "lipides" : np.random.uniform(low=20, high=30, size=(21,)), "retinol" : np.random.uniform(low=50, high=120, size=(21,)), "folates" : np.random.uniform(low=1, high=30, size=(21,)), "proteines" : np.random.uniform(low=7, high=20, size=(21,)), "cholesterol" : np.random.uniform(low=100, high=450, size=(21,))}) #CConvertir l'index data = data.set_index("fromage") #Créer mes groupes kmeans = cluster.KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(data) #index triés des groupes idk = np.argsort(kmeans.labels_) #moyenne par variable m = data.mean() #TSS TSS = data.shape[0]*data.var(ddof=0) #data.frame conditionnellement aux groupes gb = data.groupby(kmeans.labels_) #effectifs conditionnels nk = gb.size() #MOYENNE DES FACTEURS PAR CLASSE mk = gb.mean() #pour chaque groupe écart à la moyenne par variable EMk = (mk-m)**2 #pondéré par les effectifs du groupe EM = EMk.multiply(nk,axis=0) #somme des valeurs => BSS BSS = np.sum(EM,axis=0) #variance expliquée par l'appartenance aux groupes pour chaque variable R2 = BSS/TSSIs it possible to extract these numbers with one of the libraries that I used?
Thank you.