zoukankan      html  css  js  c++  java
  • 机器学习实战5:k-means聚类:二分k均值聚类+地理位置聚簇实例

      k-均值聚类是非监督学习的一种,输入必须指定聚簇中心个数k。k均值是基于相似度的聚类,为没有标签的一簇实例分为一类。

      一 经典的k-均值聚类  

      思路:  

      1 随机创建k个质心(k必须指定,二维的很容易确定,可视化数据分布,直观确定即可);

      2 遍历数据集的每个实例,计算其到每个质心的相似度,这里也就是欧氏距离;把每个实例都分配到距离最近的质心的那一类,用一个二维数组数据结构保存,第一列是最近质心序号,第二列是距离;

      3 根据二维数组保存的数据,重新计算每个聚簇新的质心;

      4 迭代2 和 3,直到收敛,即质心不再变化;

    from numpy import *
    
    def loadDataSet(fileName):      #general function to parse tab -delimited floats
        dataMat = []                #assume last column is target value
        fr = open(fileName)
        for line in fr.readlines():
            curLine = line.strip().split('	')
            fltLine = map(float,curLine) #map all elements to float()
            dataMat.append(fltLine)
        return dataMat
    
    def distEclud(vecA, vecB):
        return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)
    
    def randCent(dataSet, k):
        n = shape(dataSet)[1]
        centroids = mat(zeros((k,n)))#create centroid mat
        for j in range(n):#create random cluster centers, within bounds of each dimension
            minJ = min(dataSet[:,j]) 
            rangeJ = float(max(dataSet[:,j]) - minJ)
            centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
        return centroids
        
    def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
        m = shape(dataSet)[0]
        clusterAssment = mat(zeros((m,2)))#create mat to assign data points 
                                          #to a centroid, also holds SE of each point
        centroids = createCent(dataSet, k)
        clusterChanged = True
        while clusterChanged:
            clusterChanged = False
            for i in range(m):#for each data point assign it to the closest centroid
                minDist = inf; minIndex = -1
                for j in range(k):
                    distJI = distMeas(centroids[j,:],dataSet[i,:])
                    if distJI < minDist:
                        minDist = distJI; minIndex = j
                if clusterAssment[i,0] != minIndex: clusterChanged = True
                clusterAssment[i,:] = minIndex,minDist**2
            print centroids
            for cent in range(k):#recalculate centroids
                ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
                centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean 
        return centroids, clusterAssment

      经典的k均值聚类有很大的缺点就是很容易收敛到局部最优,为了避免这种局部最优,我们引入了二分k-均值算法。

      二 二分k-均值聚类算法

      二分k-均值聚类算法是基于经典k-均值算法实现的;里面调用经典k-均值(k=2),把一个聚簇分成两个,迭代到分成k个停止;

      具体思路:

      1 把整个数据集看成一个聚簇,计算质心;并用同样的数据结构二维数组保存每个实例到质心的距离;

      2 对每一个聚簇进行2-均值聚类划分;

      3 计算划分后的误差,选择所有被划分的聚簇中总误差最小的划分保存;

      4 迭代2 和 3 直到聚簇数目达到k停止;

    def biKmeans(dataSet, k, distMeas=distEclud):
        m = shape(dataSet)[0]
        clusterAssment = mat(zeros((m,2)))
        centroid0 = mean(dataSet, axis=0).tolist()[0]
        centList =[centroid0] #create a list with one centroid
        for j in range(m):#calc initial Error
            clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
        while (len(centList) < k):
            lowestSSE = inf
            for i in range(len(centList)):
                ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#get the data points currently in cluster i
                centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)
                sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum
                sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
                print "sseSplit, and notSplit: ",sseSplit,'--',sseNotSplit
                if (sseSplit + sseNotSplit) < lowestSSE:
                    bestCentToSplit = i
                    bestNewCents = centroidMat
                    bestClustAss = splitClustAss.copy()
                    lowestSSE = sseSplit + sseNotSplit
            bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #change 1 to 3,4, or whatever
            bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
            print 'the bestCentToSplit is: ',bestCentToSplit
            print 'the len of bestClustAss is: ', len(bestClustAss)
            centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]#replace a centroid with two best centroids 
            centList.append(bestNewCents[1,:].tolist()[0])
            clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE
        return mat(centList), clusterAssment

      三 地理位置聚簇实例

      地理位置的经纬度正好是二维的,可以可视化出来,所以很适合聚类算法确定质心个数k值;值得注意的是,球面计算距离,不能简单的用欧式距离,而需要用球面距离公式,见函数distSLC;

      代码的含义给定n个俱乐部地址名称,然后使用urllib包,调用yahoo地图的API返回经纬度,调用我们上面实现的k均值聚类算法,找到聚簇的中心,最后利用matplotlib工具可视化出来;

    import urllib
    import json
    def geoGrab(stAddress, city):
        apiStem = 'http://where.yahooapis.com/geocode?'  #create a dict and constants for the goecoder
        params = {}
        params['flags'] = 'J'#JSON return type
        params['appid'] = 'aaa0VN6k'
        params['location'] = '%s %s' % (stAddress, city)
        url_params = urllib.urlencode(params)
        yahooApi = apiStem + url_params      #print url_params
        print yahooApi
        c=urllib.urlopen(yahooApi)
        return json.loads(c.read())
    
    from time import sleep
    def massPlaceFind(fileName):
        fw = open('places.txt', 'w')
        for line in open(fileName).readlines():
            line = line.strip()
            lineArr = line.split('	')
            retDict = geoGrab(lineArr[1], lineArr[2])
            if retDict['ResultSet']['Error'] == 0:
                lat = float(retDict['ResultSet']['Results'][0]['latitude'])
                lng = float(retDict['ResultSet']['Results'][0]['longitude'])
                print "%s	%f	%f" % (lineArr[0], lat, lng)
                fw.write('%s	%f	%f
    ' % (line, lat, lng))
            else: print "error fetching"
            sleep(1)
        fw.close()
        
    def distSLC(vecA, vecB):#Spherical Law of Cosines
        a = sin(vecA[0,1]*pi/180) * sin(vecB[0,1]*pi/180)
        b = cos(vecA[0,1]*pi/180) * cos(vecB[0,1]*pi/180) * 
                          cos(pi * (vecB[0,0]-vecA[0,0]) /180)
        return arccos(a + b)*6371.0 #pi is imported with numpy
    
    import matplotlib
    import matplotlib.pyplot as plt
    def clusterClubs(numClust=5):
        datList = []
        for line in open('places.txt').readlines():
            lineArr = line.split('	')
            datList.append([float(lineArr[4]), float(lineArr[3])])
        datMat = mat(datList)
        myCentroids, clustAssing = biKmeans(datMat, numClust, distMeas=distSLC)
        fig = plt.figure()
        rect=[0.1,0.1,0.8,0.8]
        scatterMarkers=['s', 'o', '^', '8', 'p', 
                        'd', 'v', 'h', '>', '<']
        axprops = dict(xticks=[], yticks=[])
        ax0=fig.add_axes(rect, label='ax0', **axprops)
        imgP = plt.imread('Portland.png')
        ax0.imshow(imgP)
        ax1=fig.add_axes(rect, label='ax1', frameon=False)
        for i in range(numClust):
            ptsInCurrCluster = datMat[nonzero(clustAssing[:,0].A==i)[0],:]
            markerStyle = scatterMarkers[i % len(scatterMarkers)]
            ax1.scatter(ptsInCurrCluster[:,0].flatten().A[0], ptsInCurrCluster[:,1].flatten().A[0], marker=markerStyle, s=90)
        ax1.scatter(myCentroids[:,0].flatten().A[0], myCentroids[:,1].flatten().A[0], marker='+', s=300)
        plt.show()

      四 总结

      优点:易实现;

      缺点:可能收敛到局部最小值,在大数据集上收敛较慢;

      适用数据类型:数值型

  • 相关阅读:
    Hdu 5396 Expression (区间Dp)
    Lightoj 1174
    codeforces 570 D. Tree Requests (dfs)
    codeforces 570 E. Pig and Palindromes (DP)
    Hdu 5385 The path
    Hdu 5384 Danganronpa (AC自动机模板)
    Hdu 5372 Segment Game (树状数组)
    Hdu 5379 Mahjong tree (dfs + 组合数)
    Hdu 5371 Hotaru's problem (manacher+枚举)
    Face The Right Way---hdu3276(开关问题)
  • 原文地址:https://www.cnblogs.com/rongyux/p/5641825.html
Copyright © 2011-2022 走看看