K-means中的K值选择

zoukankan html css js c++ java

K-means中的K值选择
关于如何选择Kmeans等聚类算法中的聚类中心个数，主要有以下方法（译自维基）：

1. 最简单的方法：K≈sqrt(N/2)

2. 拐点法：把聚类结果的F-test值（类间Variance和全局Variance的比值）对聚类个数的曲线画出来，选择图中拐点

3. 基于Information Critieron的方法：如果模型有似然函数（如GMM），用BIC、DIC等决策；即使没有似然函数，如KMean，也可以搞一个假似然出来，例如用GMM等来代替

4. 基于信息论的方法（Jump法），计算一个distortion函数对K值的曲线，选择其中的jump点

5. Silhouette法

6. 交叉验证

7. 特别地，在文本中，如果词频矩阵为m*n维度，其中t个不为0，则K≈m*n/t

8. 核方法：构造Kernal矩阵，对其做eigenvalue decomposition，通过结果统计Compactness，获得Compactness—K曲线，选择拐点

另外，关于何如选择初始点，一般选择data cloud中相聚较远的点，例如SPSS定义了两个规则来寻找这样的点：

首先随机选K个初始点，然后对其余每个点
- a) If the case is farther from the centre closest to it than the distance between two most close to each other centres, the case substitutes that centre of the latter two to which it is closer.
- b) If the case is farther from the centre 2nd closest to it than the distance between the centre closest to it and the centre closest to this latter one, the case substitutes the centre closest to it.
查看全文

相关阅读:
Linux Command
sql查询将列里面的值替换为别的值但是实际值不变
 MY_SQLCode
ComboBox设置Text属性
 WPF bmp和二进制转换
 C#中打开文件、目录、保存窗口
 WPF实现右键菜单
 BarTender SDK 实现调用模板条码打印
 VS Code非英语版本连接TFS错误解决方案
 DBeaver连接达梦数据库

原文地址：https://www.cnblogs.com/washa/p/4027284.html