zoukankan      html  css  js  c++  java
  • Spark 机器学习 ---CountVectorizer

    文本特征提取->> CountVectorizer:基于词频数的文档向量

    package Spark_MLlib import org.apache.spark.ml.feature.CountVectorizer import org.apache.spark.sql.SparkSession
    object 特征抽取_CountVectorizer { val spark=SparkSession.builder().master("local").appName("CountVectorizer").getOrCreate() import spark.implicits._ def main(args: Array[String]): Unit = { val df= spark.createDataFrame(Seq( (0,Array("soyo","spark","soyo2","soyo","8")), (1,Array("soyo","hadoop","soyo","hadoop","xiaozhou","soyo2","spark","8","8")), (2,Array("soyo","spark","soyo2","hadoop","soyo3","8")), (3,Array("soyo","spark","soyo20","hadoop","soyo2","8","8")), (4,Array("soyo","8","spark","8","spark","spark","8")) )).toDF("id","words") val CountVectorizer_Model=new CountVectorizer().setInputCol("words").setOutputCol("features").setVocabSize(3).setMinDF(5).fit(df)//设置词汇表的最大个数为3,在5个文档中出现 //将根据语料库(所有文档)中的词频排序从高到低进行选择 CountVectorizer_Model.vocabulary.foreach(println) CountVectorizer_Model.transform(df).show(false) } }

    结果:

    8
    spark
    soyo
    +---+----------------------------------------------------------+-------------------------+
    |id |words                                                     |features                 |
    +---+----------------------------------------------------------+-------------------------+
    |0  |[soyo, spark, soyo2, soyo, 8]                             |(3,[0,1,2],[1.0,1.0,2.0])|
    |1  |[soyo, hadoop, soyo, hadoop, xiaozhou, soyo2, spark, 8, 8]|(3,[0,1,2],[2.0,1.0,2.0])|
    |2  |[soyo, spark, soyo2, hadoop, soyo3, 8]                    |(3,[0,1,2],[1.0,1.0,1.0])|
    |3  |[soyo, spark, soyo20, hadoop, soyo2, 8, 8]                |(3,[0,1,2],[2.0,1.0,1.0])|
    |4  |[soyo, 8, spark, 8, spark, spark, 8]                      |(3,[0,1,2],[3.0,3.0,1.0])|
    +---+----------------------------------------------------------+-------------------------+
    将5篇文档中的词去重后就组成了一个字典,这个字典中有3个词:8,spark,soyo,分别建立索引为0,1,2.
    在第三列的文档向量,是由基于字典的索引向量,与对应索引的词频向量所组成的。
    文档向量是稀疏的表征,例子中只有3个词可能感觉不出,在实际业务中,字典的长度是上万,而文章中出现的词可能是几百或几千,故很多索引对应的位置词频都是0.

  • 相关阅读:
    HTML5新增标签,表单及属性
    css3背景属性
    移动端隐藏滚动条
    css3选择符
    Hibernate读书笔记条件查询
    Hibernate读书笔记事件机制
    Hibernate读书笔记SQL查询
    Hibernate读书笔记HQL查询
    Hibernate读书笔记Hibernate知识总结
    Hibernate读书笔记缓存
  • 原文地址:https://www.cnblogs.com/soyo/p/7748019.html
Copyright © 2011-2022 走看看