zoukankan html css js c++ java

Spark OneHot编码原理

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}

import spark.implicits._
case class Person(id: Long, category: String, age: Long)
val df = spark.createDataFrame(
    Seq(Person(0, "a", 10),
        Person(1, "b", 5),
        Person(2, "c", 4),
        Person(3, "a", 11),
        Person(4, "a", 20),
        Person(5, "c", 1)
    ))
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex")
/**使用OneHotEncoder将分类变量转换为二进制稀疏向量*/
val encoder = new OneHotEncoder().setInputCol(indexer.getOutputCol).setOutputCol("categoryClassVec")
val assembler = new VectorAssembler().setInputCols(Array("categoryClassVec","age")).setOutputCol("features")
val pipeline = new Pipeline()
  .setStages(Array(indexer,encoder,assembler))
val featureDF = pipeline.fit(df).transform(df)
featureDF.show()

+---+--------+---+-------------+----------------+--------------+
| id|category|age|categoryIndex|categoryClassVec|      features|
+---+--------+---+-------------+----------------+--------------+
|  0|       a| 10|          0.0|   (2,[0],[1.0])|[1.0,0.0,10.0]|
|  1|       b|  5|          2.0|       (2,[],[])| [0.0,0.0,5.0]|
|  2|       c|  4|          1.0|   (2,[1],[1.0])| [0.0,1.0,4.0]|
|  3|       a| 11|          0.0|   (2,[0],[1.0])|[1.0,0.0,11.0]|
|  4|       a| 20|          0.0|   (2,[0],[1.0])|[1.0,0.0,20.0]|
|  5|       c|  1|          1.0|   (2,[1],[1.0])| [0.0,1.0,1.0]|
+---+--------+---+-------------+----------------+--------------+

python - How to interpret results of Spark OneHotEncoder - Stack Overflow

查看全文

相关阅读:
解決 centos -bash: vim: command not found
linux环境下安装tomcat6
由于防火墙限制无法访问linux服务器上的tomcat应用
 linux环境下安装jdk1.6
JSP输出HTML时产生的大量空格和换行的去除方法
 git使用
 Python+selenium+eclipse+pydev自动化测试环境搭建
 jmeter 打不开提示“Not able to find Java executable or version”的解决办法
 appium如何解决每次都要安装apk的烦恼
 appium 中手势密码的定位坐标

原文地址：https://www.cnblogs.com/swordspoet/p/14673972.html