zoukankan      html  css  js  c++  java
  • 特征选择--->卡方选择器

    特征选择(Feature Selection)指的是在特征向量中选择出那些“优秀”的特征,组成新的、更“精简”的特征向量的过程。它在高维数据分析中十分常用,可以剔除掉“冗余”和“无关”的特征,提升学习器的性能。

    特征选择方法和分类方法一样,也主要分为有监督(Supervised)和无监督(Unsupervised)两种,卡方选择则是统计学上常用的一种有监督特征选择方法,它通过对特征和真实标签之间进行卡方检验,来判断该特征和真实标签的关联程度,进而确定是否对其进行选择。

    package Spark_MLlib
    
    import org.apache.spark.ml.feature.ChiSqSelector
    import org.apache.spark.ml.linalg.Vectors
    import org.apache.spark.sql.SparkSession
    
    
    object 特征选择_卡方选择器 {
         val spark= SparkSession.builder().master("local").appName("卡方特征选择").getOrCreate()
         import spark.implicits._
      def main(args: Array[String]): Unit = {
        val df=spark.createDataFrame(Seq(
          (1,Vectors.dense(0,0,30,1),1),
          (2,Vectors.dense(0,1,20,0),0),
          (3,Vectors.dense(1,0,15,2),0),
          (4,Vectors.dense(0,1,28,0),1),  //这里第一个0变为1,选2个特征输出时会不同
          (5,Vectors.dense(1,0,27,0),0)
    
        )).toDF("id","features","label")
         df.show()
        val selector=new ChiSqSelector().setNumTopFeatures(2).setFeaturesCol("features").setLabelCol("label").setOutputCol("selectedFeatures")//setNumTopFeatures(1):设置只选择和标签关联性最强的2个特征
        val selector_model=selector.fit(df)
        val result=selector_model.transform(df)
        result.show(false)
    
      }
    }

    结果:

    +---+------------------+-----+
    | id|          features|label|
    +---+------------------+-----+
    |  1|[0.0,0.0,30.0,1.0]|    1|
    |  2|[0.0,1.0,20.0,0.0]|    0|
    |  3|[1.0,0.0,15.0,2.0]|    0|
    |  4|[0.0,1.0,28.0,0.0]|    1|
    |  5|[1.0,0.0,27.0,0.0]|    0|
    +---+------------------+-----+

    +---+------------------+-----+----------------+
    |id |features          |label|selectedFeatures|
    +---+------------------+-----+----------------+
    |1  |[0.0,0.0,30.0,1.0]|1    |[0.0,30.0]      |
    |2  |[0.0,1.0,20.0,0.0]|0    |[0.0,20.0]      |
    |3  |[1.0,0.0,15.0,2.0]|0    |[1.0,15.0]      |
    |4  |[0.0,1.0,28.0,0.0]|1    |[0.0,28.0]      |
    |5  |[1.0,0.0,27.0,0.0]|0    |[1.0,27.0]      |
    +---+------------------+-----+----------------+

  • 相关阅读:
    Java 编程基础
    LING 实战
    C# 3.0\3.5 新特性
    EF Code First 入门
    C# 4.0 新特性
    JavaScript学习(二)
    JavaScript学习(一)
    csdn的blog后台程序的导航菜单的实现
    HashTable的遍历
    开通啦
  • 原文地址:https://www.cnblogs.com/soyo/p/7766197.html
Copyright © 2011-2022 走看看