zoukankan      html  css  js  c++  java
  • 层次聚类的连接标准

    缘由

    写这篇博客是因为看到一篇介绍聚类的博客,里面介绍到层次聚类时,提到了linkage criterion,博客把这翻译成了连接标准。之前很少用过层次聚类,所以对这个概念并不熟悉。于是搜索了一下,把一些知识点总结与此,大部分来源于维基百科和Quora以及scikit-learn文档。

    Linkage criteria

    维基百科上的定义是:The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.

    翻译过来是,连接标准决定了两个簇之间的距离函数。也就是说,两个簇的距离怎么衡量,怎么计算,由连接标准决定。

    维基百科上提供了10种衡量距离的方法:

    1. Maximum or complete-linkage clustering
    2. Minimum or single-linkage clustering
    3. Mean or average linkage clustering, or UPGMA
    4. Centroid linkage clustering, or UPGMC
    5. Minimum energy clustering
    6. The sum of all intra-cluster variance.
    7. The decrease in variance for the cluster being merged (Ward's criterion).
    8. The probability that candidate clusters spawn from the same distribution function (V-linkage).
    9. The product of in-degree and out-degree on a k-nearest-neighbour graph (graph degree 10. linkage).
    10. The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.

    这里的标准太多了,我就不一一讨论了,因为有几种涉及到挺复杂的数学公式,而且我们也很少用。

    which linkage criterion to use

    Quora上有人提问:What is the best linkage criterion for hierarchical cluster analysis?

    目前有一个MIT的phD回答说,很多人都对这个问题做个实验,相关的论文非常多,最后的结论是,average linkage是最有效的,当我们做层次聚类的时候要首选average linkage,而single linkage是效果最差的。。

    sklearn里的linkage criterion

    这里重点介绍sklearn里面提供的三种标准:ward, complete, average。(具体可以去看看sklearn.cluster.AgglomerativeClustering的文档)sklearn对这三个的定义是:

    • ward minimizes the variance of the clusters being merged.
    • average uses the average of the distances of each observation of the two sets.
    • complete or maximum linkage uses the maximum distances between all observations of the two sets.

    第二个和第三个还比较好理解,对应wiki里的第三个和第一个。这里ward的定义里面提到了方差,所以显得不好理解。

    wiki上的Ward's method里面有这句话:Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.

    我的理解是,起初每个点单独是一个簇,此时所有的方差都是0,所以总的方差也是0。当有合并动作时,总的方差会变大,我们要选择使总方差最小的那两个簇的合并。

  • 相关阅读:
    keras:InternalError: Failed to create session
    centos 常用命令
    centos7 安装gdal2.3.1
    centos mysql初探 -- 配置、基本操作及问题
    machine learning 之 Recommender Systems
    machine learning 之 Anomaly detection
    centos R包 tidyverse安装
    centos 问题解决记录
    R python在无图形用户界面时保存图片
    隐私政策
  • 原文地址:https://www.cnblogs.com/-Sai-/p/6666523.html
Copyright © 2011-2022 走看看