zoukankan      html  css  js  c++  java
  • 大三寒假学习进度笔记(二十)—— 模型提升和Spark中WordCount的11种实现方法

    写在前面

    今天主要学习了机器学习十讲的第四讲,然后把SparkCore中的几种常用算子都学习完毕,用WordCount做了一个小总结。

    机器学习部分

    今天的学习中,首先系统的分析了模型误差出现的原因:

    用我自己理解的话说,模型空间限制了模型的表达能力,使得模型与真实数据之间存在一个客观的误差,叫做逼近误差。
    在了解了误差的存在原因后,我们就可以讨论如何去提升模型的表达能力了,即模型提升。今天的课中提到了模型集成和深度学习的方法。对于模型集成进行了详细讲解。详细的算法有决策树算法,随机森林算法以及AdaBoost算法。算法的具体解释我这里就不再赘述(以我的表达能力也能难讲解清楚算法)。今天的内容就这些了。

    Spark部分

    直接上代码,不废话

     // groupBy
      def wordCount1(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val group: RDD[(String, Iterable[String])] = words.groupBy(word => word)
        val wordCount: RDD[(String, Int)] = group.mapValues(iter => iter.size)
      }
    
      // groupByKey
      def wordCount2(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val wordOne: RDD[(String, Int)] = words.map((_, 1))
        val group: RDD[(String, Iterable[Int])] = wordOne.groupByKey()
        val wordCount: RDD[(String, Int)] = group.mapValues(iter => iter.size)
      }
    
      // reduceByKey
      def wordCount3(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val wordOne: RDD[(String, Int)] = words.map((_, 1))
        val wordCount: RDD[(String, Int)] = wordOne.reduceByKey(_ + _)
      }
    
      // aggregateByKey
      def wordCount4(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val wordOne: RDD[(String, Int)] = words.map((_, 1))
        val wordCount: RDD[(String, Int)] = wordOne.aggregateByKey(0)(_ + _, _ + _)
      }
    
      // foldByKey
      def wordCount5(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val wordOne: RDD[(String, Int)] = words.map((_, 1))
        val wordCount: RDD[(String, Int)] = wordOne.foldByKey(0)(_ + _)
      }
    
      // combineByKey
      def wordCount6(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val wordOne: RDD[(String, Int)] = words.map((_, 1))
        val wordCount: RDD[(String, Int)] = wordOne.combineByKey(v => v, (x: Int, y) => x + y, (x: Int, y: Int) => x + y)
      }
    
      // countByKey
      def wordCount7(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val wordOne: RDD[(String, Int)] = words.map((_, 1))
        val wordCount: collection.Map[String, Long] = wordOne.countByKey()
      }
    
      // countByValue
      def wordCount8(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val wordCount: collection.Map[String, Long] = words.countByValue()
      }
    
    
      // reduce,aggregate,fold
      def wordCount9(sc: SparkContext): Unit = {
        val rdd: RDD[String] = sc.makeRDD(List("Hello Scala", "Hello Spark"))
        val words: RDD[String] = rdd.flatMap(_.split(" "))
        val mapWord: RDD[mutable.Map[String, Long]] = words.map(word => mutable.Map[String, Long]((word, 1)))
        val wordCount: mutable.Map[String, Long] = mapWord.reduce((map1, map2) => {
          map2.foreach {
            case (word, count) =>
              val newCount = map1.getOrElse(word, 0L) + count
              map1.update(word, newCount)
          }
          map1
        })
      }
    

    代码难度不大,都是可以看懂的。

    总结

    今天少见的听懂了机器学习中的内容,倒是让我很成就感。SparkCore的部分也就告一段落了。

  • 相关阅读:
    CentOS 7,使用yum安装Nginx
    2019年6月Github最新开源java项目
    SQL Server清空数据库中ldf日志文件
    Spring Boot中使用 Thymeleaf
    Excel中使用Power Query获取网页json数据
    “工作做得越好,活越多,还不如偷懒?”这取决于你的目标
    Tomcat权威指南(第二版)下载pdf 高清完整中文版-百度云下载
    基于Xposed hook 实时监测微信消息
    Kotlin学习入门笔记
    批处理运行Vstest并生成HTML报告
  • 原文地址:https://www.cnblogs.com/wushenjiang/p/14347301.html
Copyright © 2011-2022 走看看