zoukankan      html  css  js  c++  java
  • k-means原理和python代码实现

    k-means:是无监督的分类算法

    k代表要分的类数,即要将数据聚为k类; means是均值,代表着聚类中心的迭代策略.

    k-means算法思想:

    (1)随机选取k个聚类中心(一般在样本集中选取,也可以自己随机选取);

    (2)计算每个样本与k个聚类中心的距离,并将样本归到距离最小的那个类中;

    (3)更新中心,计算属于k类的样本的均值作为新的中心。

    (4)反复迭代(2)(3),直到聚类中心不发生变化,后者中心位置误差在阈值范围内,或者达到一定的迭代次数。

    python实现:

    k-means简单小样例:

    import numpy as np
    
    data = np.random.randint(1,10,(30,2))
    #k=4
    k=4
    #central
    np.random.shuffle(data)
    cent = data[0:k,:]
    #distance
    distance = np.zeros((data.shape[0],k))
    last_near = np.zeros(data.shape[0])
    n=0
    while True:
        n = n+1
        print(n)
        for i in range(data.shape[0]):
            for j in range(cent.shape[0]):
                dist = np.sqrt(np.sum((data[i]-cent[j])**2))
                distance[i,j] = dist
        nearst = np.argmin(distance,axis = 1)
        if (last_near == nearst).all():
        #if n<1000:
            break
        #update central
        for ele_cen in range(k):
            cent[ele_cen] = np.mean(data[nearst == ele_cen],axis=0)
        last_near = nearst
    print(cent)
    下面样例是为了适应yolov3选取anchorbox的度量需求:

    import numpy as np
    
    
    def iou(box, clusters):
        """
        Calculates the Intersection over Union (IoU) between a box and k clusters.
        :param box: tuple or array, shifted to the origin (i. e. width and height)
        :param clusters: numpy array of shape (k, 2) where k is the number of clusters
        :return: numpy array of shape (k, 0) where k is the number of clusters
        """
        x = np.minimum(clusters[:, 0], box[0])
        y = np.minimum(clusters[:, 1], box[1])
        if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
            raise ValueError("Box has no area")
        intersection = x * y
        box_area = box[0] * box[1]
        cluster_area = clusters[:, 0] * clusters[:, 1]
        iou_ = intersection / (box_area + cluster_area - intersection)
        return iou_
    
    def kmeans(boxes, k, dist=np.median):
        """
        Calculates k-means clustering with the Intersection over Union (IoU) metric.
        :param boxes: numpy array of shape (r, 2), where r is the number of rows
        :param k: number of clusters
        :param dist: distance function
        :return: numpy array of shape (k, 2)
        """
        rows = boxes.shape[0]
    
        distances = np.empty((rows, k)) #初始化距离矩阵,rows代表样本数量,k代表聚类数量,用于存放每个样本对应每个聚类中心的距离
        last_clusters = np.zeros((rows,))#记录上一次样本所属的类型
    
        np.random.seed()
    
        # the Forgy method will fail if the whole array contains the same rows
        clusters = boxes[np.random.choice(rows, k, replace=False)]#从样本中随机选取聚类中心
    
        while True:
            for row in range(rows):
                distances[row] = 1 - iou(boxes[row], clusters) #这里是距离计算公式,这里是为了适应yolov3选取anchorbox的度量需求
            nearest_clusters = np.argmin(distances, axis=1)    #找到距离最小的类
            if (last_clusters == nearest_clusters).all(): #判断是否满足终止条件
                break
            for cluster in range(k):                        #更新聚类中心
                clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0) #将某一类的均值更新为聚类中心
            last_clusters = nearest_clusters
        return clusters

    希望可以为正在疑惑的你提供一些思路!
  • 相关阅读:
    django大全
    centos 下安装python3.6.2
    爬虫基础知识与简单爬虫实现
    HDU5950 Recursive sequence (矩阵快速幂加速递推) (2016ACM/ICPC亚洲赛区沈阳站 Problem C)
    ZOJ5833 Tournament(递归打表)
    ZOJ4067 Books(贪心)
    ZOJ4062 Plants vs. Zombies(二分+贪心)
    ZOJ4060 Flippy Sequence(思维题)
    洛谷P2568 GCD(线性筛法)
    2018.11.6刷题记录
  • 原文地址:https://www.cnblogs.com/zhibei/p/12053554.html
Copyright © 2011-2022 走看看