zoukankan      html  css  js  c++  java
  • K均值聚类

    聚类(cluster)与分类的不同之处在于, 分类算法训练过程中样本所属的分类是已知的属监督学习. 而聚类算法不需要带有分类的训练数据,而是根据样本特征的相似性将其分为几类,又称为无监督分类.

    K均值聚类(K-means cluster)算法是一种比较简单的聚类算法:

    1. 在特征空间中选择k个质心,每个质心代表一个分类

    2. 对于每个样本点计算其到各质心的距离,将其归入最近质心的类中

    3. 对于每个类计算所有样本点的均值,作为新的质心

    4. 反复执行2,3直至所有样本点分类均不再发生变化为止.

    上述算法中的距离可以采用不同的定义, 最常见的为欧式距离:

    def distEclud(vecA, vecB):
    	return sqrt(sum(power(vecA - vecB, 2)))
    

    初始质心可以在数据集边界内随机选取:

    def randCent(dataSet, k):
        n = shape(dataSet)[1]
        centers = mat(zeros((k, n)))
        for j in range(n):
            minJ = min(dataSet[:, j])
            rangeJ = float(max(dataSet[:, j]) - minJ)
            centers[:, j] = mat(minJ + rangeJ * random.rand(k, 1))
        return centers
    

    实现KMean算法:

    def kMeans(dataSet, k, distMethod=distEclud, createCent=randCent):
        m = shape(dataSet)[0]
        clusterAssess = mat(zeros((m, 2)))
        centers = createCent(dataSet, k)
        clusterChanged = True
        while clusterChanged:
            clusterChanged = False
            for i in range(m):  # for each sample
                # get closest center
                minDist = inf
                minIndex = -1
                for j in range(k):  # for each class
                    dist = distMethod(centers[j, :], dataSet[i, :])
                    if dist < minDist:
                        minDist = dist
                        minIndex = j
                if clusterAssess[i, 0] != minIndex:
                    clusterChanged = True
                clusterAssess[i, :] = minIndex, minDist ** 2
            # update center
            for cent in range(k):
                ptsInClust = dataSet[nonzero(clusterAssess[:, 0].A == cent)[0]]
                centers[cent, :] = mean(ptsInClust, axis=0)
        return centers, clusterAssess
    

    centers为所有质心的坐标列表, clusterAssess记录了每个点的序号和距其质心距离的平方.

    定义误差平方和(Sum of Squared Error, SSE)为所有样本点距其质心的距离平方和, 误差越小则聚类效果越好.

    K-Mean算法很容易实现,但是需要手动指定分类数k故而在实际应用中非常不便.

    二分K均值算法是该问题的一种解决方案, 该算法仅需指定最大的分类数而自行选择最佳分类数:

    1. 将整个数据集作为一个分类

    2. 使用kMeans算法将其进行二分类

    3. 选择误差较大的分类进行进一步划分

    算法实现:

    def binKMeans(dataSet, k, distMethod=distEclud):
        m = shape(dataSet)[0]
        clusterAssess = mat(zeros((m, 2)))
        originCenters = mean(dataSet, axis=0).tolist()[0]
        centers = [originCenters]
        # get origin error
        for j in range(m):
            clusterAssess[j, 1] = distMethod(mat(originCenters), dataSet[j, :]) ** 2
        # try to cluster
        while (len(centers) < k):
            # get best spilt
            minError = inf
            for i in range(len(centers)):
                ptsInCurrCluster = dataSet[nonzero(clusterAssess[:, 0].A == i)[0], :]
                splitCenter, splitAssess = kMeans(ptsInCurrCluster, 2, distMethod)
                spiltError = sum(splitAssess[:, 1])
                formerError = sum(clusterAssess[nonzero(clusterAssess[:, 0].A != i)[0], 1])
                if (spiltError + formerError) < minError:
                    bestCentToSplit = i
                    bestNewCents = splitCenter
                    bestClustAss = splitAssess.copy()
                    minError = spiltError + formerError
            # update assessment
            bestClustAss[nonzero(bestClustAss[:, 0].A == 1)[0], 0] = len(centers)
            bestClustAss[nonzero(bestClustAss[:, 0].A == 0)[0], 0] = bestCentToSplit
            # update global centers and assessment
            centers[bestCentToSplit] = bestNewCents[0, :].tolist()[0]
            centers.append(bestNewCents[1, :].tolist()[0])
            clusterAssess[nonzero(clusterAssess[:, 0].A == bestCentToSplit)[0], :] = bestClustAss
        return centers, clusterAssess
    

    完整源码

  • 相关阅读:
    static、final、this、super关键
    细节二:参数、引用类型、实例化
    枚举类型
    单例模式
    细节一:字符串、switch、默认值、数组
    类属性和类方法
    装饰器模式
    TreeSet
    可见参数和增强for以及自动拆装箱
    静态导入
  • 原文地址:https://www.cnblogs.com/Finley/p/5797127.html
Copyright © 2011-2022 走看看