zoukankan      html  css  js  c++  java
  • Spark OneHot编码原理

    import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
    
    import spark.implicits._
    case class Person(id: Long, category: String, age: Long)
    val df = spark.createDataFrame(
        Seq(Person(0, "a", 10),
            Person(1, "b", 5),
            Person(2, "c", 4),
            Person(3, "a", 11),
            Person(4, "a", 20),
            Person(5, "c", 1)
        ))
    val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex")
    /**使用OneHotEncoder将分类变量转换为二进制稀疏向量*/
    val encoder = new OneHotEncoder().setInputCol(indexer.getOutputCol).setOutputCol("categoryClassVec")
    val assembler = new VectorAssembler().setInputCols(Array("categoryClassVec","age")).setOutputCol("features")
    val pipeline = new Pipeline()
      .setStages(Array(indexer,encoder,assembler))
    val featureDF = pipeline.fit(df).transform(df)
    featureDF.show()
    
    +---+--------+---+-------------+----------------+--------------+
    | id|category|age|categoryIndex|categoryClassVec|      features|
    +---+--------+---+-------------+----------------+--------------+
    |  0|       a| 10|          0.0|   (2,[0],[1.0])|[1.0,0.0,10.0]|
    |  1|       b|  5|          2.0|       (2,[],[])| [0.0,0.0,5.0]|
    |  2|       c|  4|          1.0|   (2,[1],[1.0])| [0.0,1.0,4.0]|
    |  3|       a| 11|          0.0|   (2,[0],[1.0])|[1.0,0.0,11.0]|
    |  4|       a| 20|          0.0|   (2,[0],[1.0])|[1.0,0.0,20.0]|
    |  5|       c|  1|          1.0|   (2,[1],[1.0])| [0.0,1.0,1.0]|
    +---+--------+---+-------------+----------------+--------------+
    
    1. python - How to interpret results of Spark OneHotEncoder - Stack Overflow
  • 相关阅读:
    spring原理
    架构师和数学
    项目经理需要注意的地方
    如何快速掌握一门新技术
    项目管理要做啥
    编程原则
    架构设计的常用思想
    聊聊编程范式
    程序员与哲学家
    IT人员如何有效规划自己时间
  • 原文地址:https://www.cnblogs.com/swordspoet/p/14673972.html
Copyright © 2011-2022 走看看