zoukankan      html  css  js  c++  java
  • Spark Java API 之 CountVectorizer

    Spark Java API 之 CountVectorizer


    A CountVectorizer converts a collection of text documents into a vector representing the word count of text documents.

    在构建向量时,有两个重要的参数:VocabSizeMinDF。前者表示词典的大小,后者表示当文档中某个Term出现的次数小于MinDF时,则不计入词典(该Term不属于词典中的 单词)。

    比如说现在有两篇文档:【"w1", "w2", "w4", "w5", "w2"】,【"w1", "w2", "w3"】

    CountVectorizer cv = new CountVectorizer().setInputCol("text").setOutputCol("feature")


    When the dictionary is not defined CountVectorizer iterates over the dataset twice to prepare
    the dictionary based on frequency and size.

    CountVectorizer 首先扫描Dataset(文本数据)生成词典,然后再次扫描生成向量模型(CountVectorizerModel)

    在构造Dataset 时,需要指定模式。用模式来解释Dataset中每一行的数据。

            StructType schema = new StructType(new StructField[]{
                    new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())

    A field inside a StructType. param: name The name of this field. param: dataType The data type of this field. param: nullable Indicates if values of this field can be null values. param: metadata The metadata of this field. The metadata should be preserved during transformation if the content of the column is not modified

    第一个参数是:名称;第二个参数是dataType 数据类型;第三个参数是标识该字段的值是否可以为空;第四个参数为字段的元数据信息。


    import org.apache.spark.ml.feature.CountVectorizer;
    import org.apache.spark.ml.feature.CountVectorizerModel;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.RowFactory;
    import org.apache.spark.sql.SparkSession;
    import org.apache.spark.sql.types.*;
    import java.util.Arrays;
    import java.util.List;
    public class CounterVectorExample {
        public static void main(String[] args) {
            SparkSession spark = SparkSession.builder().appName("CountVectorizer").master("spark://").getOrCreate();
            List<Row> data = Arrays.asList(
    //                RowFactory.create(Arrays.asList("a", "b", "c")),
    //                RowFactory.create(Arrays.asList("a", "b", "b", "c", "a")),
    //                RowFactory.create(Arrays.asList("a", "b", "a", "b"))
                    RowFactory.create(Arrays.asList("w1", "w2", "w3")),
                    RowFactory.create(Arrays.asList("w1", "w2", "w4", "w5", "w2"))
            StructType schema = new StructType(new StructField[]{
                    new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
            Dataset<Row> df = spark.createDataFrame(data, schema);
            CountVectorizer cv = new CountVectorizer().setInputCol("text").setOutputCol("feature")
            CountVectorizerModel cvModel = cv.fit(df);
            //prior dictionary
            CountVectorizerModel cvm = new CountVectorizerModel(new String[]{"a", "b", "c"}).setInputCol("text")
    //        cvm.


    A sparse vector represented by an index array and a value array.

    param: size size of the vector. param: indices index array, assume to be strictly increasing. param: values value array, must have the same length as the index array.

    第一个字段代表:向量长度,由于这里词典中只有2个Term,因此转换出来的向量长度为2;第二个字段:索引下标;第三个字段:索引位置处相应的向量元素值。由上图中位置0处的Term是 w2,位置1处的Term是w1,因此,输出:

    当然,我们也可以预先定义词典:在构造CountVectorizerModel的时候指定词典:【"w1", "w2", "w3"】

            //prior dictionary
            CountVectorizerModel cvm = new CountVectorizerModel(new String[]{"w1", "w2", "w3"}).setInputCol("text").setOutputCol("feature");

    对于文本:[w1,w2,w3],每个Term都在词典中,且出现了一次,因此稀疏特征向量表示为:(3,[0,1,2],[1.0,1.0,1.0])。其中,3代表向量的长度为3维向量;[0,1,2]表示向量的索引;[1.0,1.0,1.0]表示,在相应的索引处,每个元素值为1.0(即各个Term只出现了一次)。而对于文本[w1, w2, w4, w5, w2],因为w4和w5不在词典中,w1出现一次,w2出现2次,故其特征如下:




  • 相关阅读:
    2016ACM竞赛训练暑期课期末考试 a题
    百练_1664 放苹果
    百练_4120 硬币(DP)
    PAT_1046 划拳
    PAT_1026 程序运行时间
    学Android开发 这19个开发工具助你顺风顺水
  • 原文地址:https://www.cnblogs.com/hapjin/p/9899164.html
Copyright © 2011-2022 走看看