1:Spark ML与Spark MLLIB区别?
Spark MLlib是面向RDD数据抽象的编程工具类库,现在已经逐渐不再被Spark团队支持,逐渐转向Spark ML库,Spark ML是面向DataFrame编程的。
2:Spark ML与Spark MLLIB中矩阵、向量定义区别?
这两个类库中的矩阵与向量对比可以发现几乎都是一样的,就是为了以后维护Spark ML方便。
3:Spark ML中稀疏向量与稠密向量区别?
稠密向量存储:底层存储使用完成的Double Array存储。
稀疏矩阵:底层存储非0的元素值以及该值的index以及向量的size。(也就是三维信息,存储效率高)
4:稠密向量示例:
import org.apache.spark.ml.linalg.{DenseVector => MLDenseVector} val mlDv = new MLDenseVector(Array[Double](1, 1, 1, 1, 1)) println(mlDv.argmax) //压缩矩阵,底层根据0的个数进行判断是稀疏存储还是稠密存储。稀疏存储就是存储非0的元素值以及索引以及向量的大小(也就是三维) println(mlDv.compressed) val copy = mlDv.copy //深拷贝 copy.foreachActive { (x, y) => println("index = " + x + " , value = " + y) } //Number of active entries. An "active entry" is an element which is explicitly(明确地) stored, // regardless of its value. Note that inactive entries have value 0. println(copy.numActives) println(copy.numNonzeros) println(copy.size) println(copy.values) println(copy.toSparse)
5:稀疏矩阵
import org.apache.spark.ml.linalg.{SparseVector => MLSparseVector} val mlDv = new MLDenseVector(Array[Double](1, 0, 0, 0, 0)) println(mlDv.toSparse) //(5,[0],[1.0]) //SparseVector构造器:向量维度,非零索引,非零索引对应的值 val mlSv = new MLSparseVector(5, Array[Int](0, 3), Array[Double](1, 2)) println(mlSv) //(5,[0,3],[1.0,2.0]) println(mlSv.toDense) //[1.0,0.0,0.0,2.0,0.0] println(mlSv.indices.toBuffer)//返回稀疏向量的索引
对于mllib下的向量可以使用asML直接转成ML中的向量
//稀疏矩阵 import org.apache.spark.mllib.linalg.{DenseVector => MLLIBDenseVector} val mlDv = new MLLIBDenseVector(Array[Double](1, 0, 0, 0, 0)) mlDv.asML //直接转成spark ml的向量
6:ML中矩阵
import org.apache.spark.ml.linalg.{DenseMatrix => MLDenseMatrix} import org.apache.spark.ml.linalg.{SparseMatrix => MLSparseMatrix} // 默认以列为主的稠密矩阵。 val notTranspose = new MLDenseMatrix(3, 2, Array[Double](1, 3, 5, 2, 4, 6)) // 第三个参数为是否允许转至,默认不允许,如果允许则按行存储 val mlDMtx = new MLDenseMatrix(3, 2, Array[Double](1, 2, 3, 4, 5, 6), true) println(notTranspose) println("-------------------------------------------------") println(notTranspose.isTransposed) println(notTranspose.transpose) println(mlDMtx.isTransposed) println("-------------------------------------------------") println(mlDMtx) println(mlDMtx.compressed) println("-------------------------------------------------") //转为按照列存储的稠密矩阵 println(mlDMtx.toDenseColMajor) //转为按照行存储的稠密矩阵 println(notTranspose.toDenseRowMajor)
7稀疏矩阵:
println("--------------------MLSparseMatrix-----------------------------") // numRows - number of rows // numCols - number of columns // colPtrs - the index corresponding to the start of a new column // rowIndices - the row index of the entry. They must be in strictly increasing order for each column // values - non-zero matrix entries in column major // (0, 2, 1, 0, 1, 2) // (0, 2, 3, 6)=> (2-0,3-2,6-3 )得到每一列非零元素个数 // (1.0, 2.0, 3.0, 4.0, 5.0, 6.0) val mlSM = new MLSparseMatrix(3, 3, Array[Int](0, 2, 3, 6), Array[Int](0, 2, 1, 0, 1, 2), Array[Double](1.0, 2.0, 3.0, 4.0, 5.0, 6.0)) println(mlSM.toDense)