zoukankan      html  css  js  c++  java
  • Spark 机器学习 ---Word2Vec

    package Spark_MLlib
    
    import org.apache.spark.ml.feature.Word2Vec
    import org.apache.spark.sql.SparkSession
    
    
    object 特征抽取_Word2Vec {
          val spark=SparkSession.builder().master("local").appName("Word2Vec").getOrCreate()
          import spark.implicits._
      def main(args: Array[String]): Unit = {
    
            val documentDF= spark.createDataFrame(Seq(
              "soyo like spark and hadoop".split(" "),
              "scala is good tool to study".split(" "),
              "but java i want to study and spark".split(" "),
               "soyo like spark and hadoop ".split(" ")
            ).map(Tuple1.apply)).toDF("text")
            val word2Vec=new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(5).setMinCount(0)  //设置特征向量维数为5
            val word2Vec_model=word2Vec.fit(documentDF)  //训练模型
            val result=word2Vec_model.transform(documentDF) //把文档转换成特征向量
                result.show(false)
    
      }
    }
    结果:文档相同或着相似 特征向量就相同或者在特征空间中特征向量越相近
    |text                                       |result                                                                                                       |
    +-------------------------------------------+-------------------------------------------------------------------------------------------------------------+
    |[soyo, like, spark, and, hadoop]           |[0.010919421538710596,-0.013777335733175279,0.02715198565274477,-0.010085364431142808,0.019428260042332113]  |
    |[scala, is, good, tool, to, study]         |[-0.048216115372876324,-0.00931493720660607,0.0237591746263206,0.04614267808695634,0.018560086687405903]     |
    |[but, java, i, want, to, study, and, spark]|[0.025922087021172047,-0.027650322022964247,0.029493116540834308,-0.029830976389348507,-0.025802675168961287]|
    |[soyo, like, spark, and, hadoop]           |[0.010919421538710596,-0.013777335733175279,0.02715198565274477,-0.010085364431142808,0.019428260042332113]  |
    +-------------------------------------------+-------------------------------------------------------------------------------------------------------------+

    红色的两个文档相同
  • 相关阅读:
    ASP.NET Core 个人新闻项目
    C# 检查字符串中是否有HTML标签、返回过滤掉所有的HTML标签后的字符串
    VueCLI 页面加载进度条效果
    replace() 方法使用
    CentOS 7.9安装教程
    在Windows中安装MySQL
    linux安装consul
    jenkins Skywalking安装部署文档总结
    CentOS 7.x安装.NET运行时
    Apollo部署文档
  • 原文地址:https://www.cnblogs.com/soyo/p/7746957.html
Copyright © 2011-2022 走看看