zoukankan      html  css  js  c++  java
  • unsupervised learning -- K MEANS

    Altough it sounds quiet like KNN algorithm, however, KNN is a kind of classification algorithm of supervised learning while K MEANS is a kind of unsupervised learning algorithm. 

    K MEANS as a cluster method, can figure out k classes from the given dataset without labels, in which the class number k is given by user. 

    The procedure of K MEANS algorithm is:

    1. initial the centroids with radom points in dataset, which represent k classes
    2. calculate the others label based on these k classes through the minimum distence from the centroids
    3. recalcute the centroids based on the labels we calculated in the 2nd step
    4. repeat until the iterations ends

    And here is the procedure of the naive K MEANS algorithm:

     

    we can use K MEANS algorithm simply from sklearn:

    from sklearn.cluster import KMeans
    Kmean = KMeans(n_clusters=2)
    Kmean.fit(X)

    And here is a more explicit code

    import numpy as np
    from numpy.linalg import norm
    
    
    class Kmeans:
        '''Implementing Kmeans algorithm.'''
    
        def __init__(self, n_clusters, max_iter=100, random_state=123):
            self.n_clusters = n_clusters
            self.max_iter = max_iter
            self.random_state = random_state
    
        def initializ_centroids(self, X):
            np.random.RandomState(self.random_state)
            random_idx = np.random.permutation(X.shape[0])
            centroids = X[random_idx[:self.n_clusters]]
            return centroids
    
        def compute_centroids(self, X, labels):
            centroids = np.zeros((self.n_clusters, X.shape[1]))
            for k in range(self.n_clusters):
                centroids[k, :] = np.mean(X[labels == k, :], axis=0)
            return centroids
    
        def compute_distance(self, X, centroids):
            distance = np.zeros((X.shape[0], self.n_clusters))
            for k in range(self.n_clusters):
                row_norm = norm(X - centroids[k, :], axis=1)
                distance[:, k] = np.square(row_norm)
            return distance
    
        def find_closest_cluster(self, distance):
            return np.argmin(distance, axis=1)
    
        def compute_sse(self, X, labels, centroids):
            distance = np.zeros(X.shape[0])
            for k in range(self.n_clusters):
                distance[labels == k] = norm(X[labels == k] - centroids[k], axis=1)
            return np.sum(np.square(distance))
        
        def fit(self, X):
            self.centroids = self.initializ_centroids(X)
            for i in range(self.max_iter):
                old_centroids = self.centroids
                distance = self.compute_distance(X, old_centroids)
                self.labels = self.find_closest_cluster(distance)
                self.centroids = self.compute_centroids(X, self.labels)
                if np.all(old_centroids == self.centroids):
                    break
            self.error = self.compute_sse(X, self.labels, self.centroids)
        
        def predict(self, X):
            distance = self.compute_distance(X, old_centroids)
            return self.find_closest_cluster(distance)

    ref:https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a 

  • 相关阅读:
    Metasploit的使用命令_1
    Kali Linux -系统定制
    20200522随笔
    阿里大于接口的问题
    根据一篇文章学习逻辑漏洞
    flask注册蓝图报错
    python 生成验证码
    flask-mail 机制
    对巡风vulscan的理解
    “百度杯” YeSerCMS
  • 原文地址:https://www.cnblogs.com/yuelien/p/13883113.html
Copyright © 2011-2022 走看看