zoukankan      html  css  js  c++  java
  • ISLR第10章 Unsupervised Learning

    Unsupervised learning, a set of statistical tools intended for the setting in which we have only a set of features
    X1,X2, . . . , Xp measured on n observations. We are not interested in prediction, because we do not have an

    associated response variable Y .

    下面主要讲了10.1主成分分析和10.2聚类分析。

    10.1 Principal Components Analysis(主成分分析PCA)

    PCA is an unsupervised approach, since it involves only a set of features X1,X2, . . . , Xp,  and no associated response Y .

    Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization

    (visualization of the observations or visualization of the variables).


    10.1.1 What Are Principal Components?

    The idea is that each of the n observations lives in p-dimensional space, but not all of these dimensions are equally interesting.

    PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by

    the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination

    of the p features.

    The  second principal component scores z12, z22, . . . , zn2 take the form

             

    biplot(双标图):

            

              

    主成分荷载向量定义为变量空间上数据变异程度最大的方向,而将主成分得分定义为在这些方向上的投影。

    主成分的另一种解释:主成分提供了一个与观测最为接近的低维线性空间。

    变量的标准化:在进行PCA之前,变量应该标准化使均值为零,标准差为1。

    主成分的唯一性:在不考虑符号变化的情况下,每个主成分载荷向量都是唯一的。

    方差的解释比例(PVE): 

    In total, there are min(n − 1, p) principal components, and their PVEs sum to one.

    Deciding How Many Principal Components to Use:

    方法:scree plot(碎石图)

    10.2聚类分析

    数据聚类则是将本没有类别参考的数据进行分析并划分为不同的组,即从这些数据导出类标号。

    分为K均值聚类和系统聚类法

    10.2.1 K均值聚类

     K均值聚类是一种把数据集分成K个不重复类的简单快捷的方法。在进行K均值聚类时,必须首先确定K的值。

     

    The idea behind K-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible.

    The within-cluster variation   for cluster Ck is a measure W(Ck) of the amount by which the observations within a cluster differ

    from each other. Hence we want to solve the problem:

                                   

    类内差异(W(Ck)):对第Ck类中观测互不相同程度的度量

    定义为:

                               

    结合10.9和10.10,K-means聚类的最优化问题可以定义为:

                            

  • 相关阅读:
    img标签中alt属性与title属性在seo的作用-摘自网友
    C# 从补码中获取有符号数的实际数值
    you need to load the kernel first
    桌面远程访问
    供应商通过向日葵访问公司外网办公电脑,通过办公电脑访问内网内生产用电脑
    配置交换机口可以上外网
    抠图和不失真的改变图形大小
    用机房现有双网口电脑添加监控
    服务器配置IP
    在DELL服务器上安装windows2012 r2服务器系统
  • 原文地址:https://www.cnblogs.com/weiququ/p/8647228.html
Copyright © 2011-2022 走看看