zoukankan      html  css  js  c++  java
  • Simple Hierarchical clustering in Python 2.7 using SciPy

    Code snippets

    Code snippets  

    Simple Hierarchical clustering in Python 2.7 using SciPy

    I've found that there's not a lot of useful information on how to do Hierarchical clustering in SciPy, which is rather easy. First, you need to organise your data as an array with each column being a dimension, and each row being an observation. Here's an example with nine observations each with three dimensions.

    data  = [[0.1,0.1,0.1],
            [0.1,0.1,0.1],
            [0.1,0.1,0.1],
            [0.2,0.2,0.2],
            [0.2,0.2,0.2],
            [0.2,0.2,0.2],
            [0.3,0.3,0.3],
            [0.3,0.3,0.3],
            [0.3,0.3,0.3],]
    

    We need to create a distance matrix (calculate the distance between each pair of observations). I'm using the default (euclidian) distance metric (the SciPy documentation for spatial.distance.pdist gives more information on difference distance metrics you can use).

    from scipy import spatial
    distance = spatial.distance.pdist(data)
    

    Next, we need to calculate the linkage; the SciPy documentation has information on other built-in methods. I'm using the fastcluster package to speed things up (it's a drop in replacement for SciPy's cluster module).

    import fastcluster
    linkage = fastcluster.linkage(distance,method="complete")
    

    linkage is a list containing the instructions to merge clusters together starting with each observation being its own cluster and ending in everything being one cluster. There's a plot.dendrogram method which will plot this for you, but if we wanted to get the members when there are n clusters (let's say that we want 3 in this case) then you have to do the following.

    # We now iterate over the linkage object, merging clusters together until there are clusternum clusters left.
    clusternum = 3
    clustdict = {i:[i] for i in xrange(len(linkage)+1)}
    for i in xrange(len(linkage)-clusternum+1):
        clust1= int(linkage[i][0])
        clust2= int(linkage[i][1])
        clustdict[max(clustdict)+1] = clustdict[clust1] + clustdict[clust2]
        del clustdict[clust1], clustdict[clust2]
    

    If we print clustdict, the keys refer to the cluster number, and the values are the members of said cluster (in the form of indices of the initial data array)

    print clustdict
    >>> {10: [2, 0, 1], 12: [5, 3, 4], 14: [8, 6, 7]}
    

    Ta da! As we can see from the really synthetic data I supplied, the clustering works wonderfully. I've been doing this with 10,000 observations of 100 dimensional data and it does the entire thing in about 10 seconds on an Intel 2.3Ghz Core i5

  • 相关阅读:
    excel多个工作表数据快速合并到一个工作表方法
    客商申请单客商编码自动编码
    如何实现Excel多人共享与协作
    商家推销技巧-将广告做成实用信息
    如何实现扫码填报信息
    DBSync如何连接并同步MySQL
    如何在微信中发布动态信息
    一款数据库比较与同步软件的设计与实现
    【原创】在 ASP.NET Core 3.1 中使用 Senparc.Weixin.Work 企业微信 SDK —— 发送文本消息
    【原创】在 .NET Core 3.1 中使用 Senparc.Weixin.Work 企业微信 SDK —— 发送文本消息
  • 原文地址:https://www.cnblogs.com/lexus/p/2815777.html
Copyright © 2011-2022 走看看