zoukankan      html  css  js  c++  java
  • PCA in MLLib

    SVD分解: (A=USigma V^T),变换:(hat{A}=Acdot V=USigma)


       * Computes the top k principal components and a vector of proportions of
       * variance explained by each principal component.
       * Rows correspond to observations and columns correspond to variables.
       * The principal components are stored a local matrix of size n-by-k.
       * Each column corresponds for one principal component,
       * and the columns are in descending order of component variance.
       * The row data do not need to be "centered" first; it is not necessary for
       * the mean of each column to be 0.
       * @param k number of top principal components.
       * @return a matrix of size n-by-k, whose columns are principal components, and
       * a vector of values which indicate how much variance each principal component
       * explains
       * @note This cannot be computed on matrices with more than 65535 columns.
      def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, Vector) = {
        val n = numCols().toInt
        require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]")
        // spark 分布式计算A^T A
        val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]
        // Breeze计算svd分解
        val brzSvd.SVD(u: BDM[Double], s: BDV[Double], _) = brzSvd(Cov)
        // explained varience 归一化成Ratio
        val eigenSum = s.data.sum
        val explainedVariance = s.data.map(_ / eigenSum)
        // 返回U,∑
        if (k == n) {
          (Matrices.dense(n, k, u.data), Vectors.dense(explainedVariance))
        } else {
          (Matrices.dense(n, k, Arrays.copyOfRange(u.data, 0, n * k)),
            Vectors.dense(Arrays.copyOfRange(explainedVariance, 0, k)))


    其中(dim(A)=mcdot n),大数据场景下m会很大,但是n一般不会很大。所以计算结果(R)的维度也不会非常大,对(R)进行PCA分解的复杂度可控,单线程计算即可。

    [egin{align*} ext{calc } A^T A &:\ &r_{ij} = sum_{k=1}^m a_{ki}cdot a_{kj}, ext{where }i,jin 1,...,n\ ext{So, }& ext{R} = sum_{k=1}^m vec{a}_k^T vec{a}_k, ext{where }vec{a}_k=[a_{k1},...,a_{kn}], ext{ $k^{th}$ row} end{align*} ]


    * Computes the Gramian matrix `A^T A`.
    * @note This cannot be computed on matrices with more than 65535 columns.
    def computeGramianMatrix(): Matrix = {
    val n = numCols().toInt
    // Computes n*(n+1)/2, avoiding overflow in the multiplication.
    // This succeeds when n <= 65535, which is checked above
    val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
    // Compute the upper triangular part of the gram matrix.
    val GU = rows.treeAggregate(new BDV[Double](nt))(
    seqOp = (U, v) => {
    BLAS.spr(1.0, v, U.data)
    }, combOp = (U1, U2) => U1 += U2)
    RowMatrix.triuToFull(n, GU.data)



        val brzSvd.SVD(u: BDM[Double], s: BDV[Double], _) = brzSvd(Cov)
        // Explained variance 归一化
        val eigenSum = s.data.sum
        val explainedVariance = s.data.map(_ / eigenSum)
        if (k == n) {
          (Matrices.dense(n, k, u.data), Vectors.dense(explainedVariance))
        } else {
          (Matrices.dense(n, k, Arrays.copyOfRange(u.data, 0, n * k)),
            Vectors.dense(Arrays.copyOfRange(explainedVariance, 0, k)))

    Explained Variance Ratio

    explained variance ratio of each principal component. It indicates
    the proportion of the dataset’s variance that lies along the axis of each principal component.

  • 相关阅读:
    python 列表、元组、字典、字符串
    TASK 总结
    python & jira
  • 原文地址:https://www.cnblogs.com/luweiseu/p/7825826.html
Copyright © 2011-2022 走看看