层次聚类的连接标准

zoukankan html css js c++ java

层次聚类的连接标准
缘由

写这篇博客是因为看到一篇介绍聚类的博客，里面介绍到层次聚类时，提到了linkage criterion，博客把这翻译成了连接标准。之前很少用过层次聚类，所以对这个概念并不熟悉。于是搜索了一下，把一些知识点总结与此，大部分来源于维基百科和Quora以及scikit-learn文档。

Linkage criteria

维基百科上的定义是：The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.

翻译过来是，连接标准决定了两个簇之间的距离函数。也就是说，两个簇的距离怎么衡量，怎么计算，由连接标准决定。

维基百科上提供了10种衡量距离的方法：
1. Maximum or complete-linkage clustering
2. Minimum or single-linkage clustering
3. Mean or average linkage clustering, or UPGMA
4. Centroid linkage clustering, or UPGMC
5. Minimum energy clustering
6. The sum of all intra-cluster variance.
7. The decrease in variance for the cluster being merged (Ward's criterion).
8. The probability that candidate clusters spawn from the same distribution function (V-linkage).
9. The product of in-degree and out-degree on a k-nearest-neighbour graph (graph degree 10. linkage).
10. The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.
这里的标准太多了，我就不一一讨论了，因为有几种涉及到挺复杂的数学公式，而且我们也很少用。

which linkage criterion to use

Quora上有人提问：What is the best linkage criterion for hierarchical cluster analysis?

目前有一个MIT的phD回答说，很多人都对这个问题做个实验，相关的论文非常多，最后的结论是，average linkage是最有效的，当我们做层次聚类的时候要首选average linkage，而single linkage是效果最差的。。

sklearn里的linkage criterion

这里重点介绍sklearn里面提供的三种标准：ward, complete, average。（具体可以去看看sklearn.cluster.AgglomerativeClustering的文档）sklearn对这三个的定义是：
- ward minimizes the variance of the clusters being merged.
- average uses the average of the distances of each observation of the two sets.
- complete or maximum linkage uses the maximum distances between all observations of the two sets.
第二个和第三个还比较好理解，对应wiki里的第三个和第一个。这里ward的定义里面提到了方差，所以显得不好理解。

wiki上的Ward's method里面有这句话：Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.

我的理解是，起初每个点单独是一个簇，此时所有的方差都是0，所以总的方差也是0。当有合并动作时，总的方差会变大，我们要选择使总方差最小的那两个簇的合并。
查看全文

相关阅读:
获取字符串的MD5值
 将对象XML序列化为XML文件/反序列化XML文件为对象
 C#通过反射获得对象所有属性和值
 Jquery.Validate使用
 JS删除确认框
 生成验证码
 文件备份方法
 Log日志类
 Ext.NET 基础学习笔记08 (FormPanel)
Tibco EMS Message trace

原文地址：https://www.cnblogs.com/-Sai-/p/6666523.html

层次聚类的连接标准

缘由

Linkage criteria

which linkage criterion to use

sklearn里的linkage criterion