zoukankan      html  css  js  c++  java
  • mllib文档笔记1

    spark.mllib contains the original API built on top of RDDs.

    spark.mllib 包含原始API构建于RDD之上。
    spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

    提供高级API构建于ML管道结构的DATAFrames之上

    MLlib supports local vectors and matrices stored on a single machine

    MLlib支持局部矢量和矩阵存储在一个单独机器上

    1、数据类型

    1)局部向量(Local vector)

    稀疏向量(sparse vector)

    稠密向量(dense vector)

    import org.apache.spark.mllib.linalg.Vector;
    import org.apache.spark.mllib.linalg.Vectors;

    // Create a dense vector (1.0, 0.0, 3.0).
    Vector dv = Vectors.dense(1.0, 0.0, 3.0);
    // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.

    //创建一个稀疏向量通过指定它的索引和值对应于相应的非零值
    Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});

    2)标记点(Labeled point)

    A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification

    用于监督学习算法,用一个双精度浮点值去存储一个标签,我们可以在回归和分类中用标记点

    Sparse data

    label index1:value1 index2:value2 ...

    where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.

    这是索引是基于1递增的序列,被加载后,这个值将被转变成基于0的开始的序列

    例子:

    MLUtils.loadLibSVMFile reads training examples stored in LIBSVM format.

    Refer to the MLUtils Java docs for details on the API.

    import org.apache.spark.mllib.regression.LabeledPoint;
    import org.apache.spark.mllib.util.MLUtils;
    import org.apache.spark.api.java.JavaRDD;

    JavaRDD<LabeledPoint> examples =
      MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD();

    2、局部矩阵(Local matrix)

    稠密矩阵(dense matrix):is stored in a one-dimensional array and the matrix size,  in column-major order.存储一个一维的向量和矩阵的大小(行、列),而且以列为主要顺序。

    稀疏矩阵(sparse matrix):Compressed Sparse Column (CSC) 压缩稀疏列

    eg:

    import org.apache.spark.mllib.linalg.Matrix;
    import org.apache.spark.mllib.linalg.Matrices;

    // Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

    //Matrices.dense(行数,列数,值)
    Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});

    // Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))

    Matrices.sparse(行数,列数,行号,列号,值)
    Matrix sm = Matrices.sparse(3, 2, new int[] {0, 1, 3}, new int[] {0, 2, 1}, new double[] {9, 6, 8});

  • 相关阅读:
    二级菜单
    eclipse高版本中EasyExplore的替换插件OpenExplore
    Python学习一
    原型编程的基本规则
    【CF671D】 Roads in Yusland(对偶问题,左偏树)
    【洛谷4542】 [ZJOI2011]营救皮卡丘(最小费用最大流)
    【洛谷4313】 文理分科(最小割)
    【洛谷4001】 [ICPC-Beijing 2006]狼抓兔子(最小割)
    【洛谷2057】 [SHOI2007]善意的投票(最小割)
    【洛谷2053】 [SCOI2007]修车(费用流)
  • 原文地址:https://www.cnblogs.com/lwhp/p/5684534.html
Copyright © 2011-2022 走看看