zoukankan      html  css  js  c++  java
  • Scala 小技巧

    单行word count

    Scala中可以一行命令就能做到word count的效果

    假设有如下文本:

    Hello mr apache spark

    Hello world apache spark

    Hello we want study spark

    Hello we want study apache

    Hello apache and hadoop


    在scala终端中将数据存入list用以模拟

    scala> var txt = List("Hello mr apache spark", "Hello world apache spark", "Hello we want study spark", "Hello we want study apache", "Hello apache and hadoop")
    txt: List[String] = List(Hello mr apache spark, Hello world apache spark, Hello we want study spark, Hello we want study apache, Hello apache and hadoop)

    我们的思路是什么?

    1. 分词,得到一个list,存放了出现的所有单词

    2. 对单词出现的次数做统计

    3. 排序

    1. 分词

    首先我们将list进行map,对里面每一串文本,按照空格切分,得到一个list 里面存放的是多个数组,数组存放的是每个单词

    scala> txt.map(_.split(" "))
    res7: List[Array[String]] = List(Array(Hello, mr, apache, spark), Array(Hello, world, apache, spark), Array(Hello, we, want, study, spark), Array(Hello, we, want, study, apache), Array(Hello, apache, and, hadoop))
    

      

    可以看到,得到的结果是 List[ Array[String]]    list里面的Array里面存放的就是一个个单词

    但是我们不想要list 里面存放的是array, 想要list里面直接存放的就是单词,下面执行:

    scala> txt.map(_.split(" ")).flatten
    res8: List[String] = List(Hello, mr, apache, spark, Hello, world, apache, spark, Hello, we, want, study, spark, Hello, we, want, study, apache, Hello, apache, and, hadoop)
    

      

    这样,就得到我们想要的样子。

    其实上面两步操作,1 map  2 flatten 可以合并为一个操作,如下:

    scala> txt.flatMap(_.split(" "))
    res9: List[String] = List(Hello, mr, apache, spark, Hello, world, apache, spark, Hello, we, want, study, spark, Hello, we, want, study, apache, Hello, apache, and, hadoop)
    

      

    2. 统计

    在得到上述结果后就需要对单词出现的个数进行统计,怎么统计呢?

    思路是,我们借鉴hadoop的mapper函数形式,对每一个单词都存入一个map(key value 集合)中,以单词为key,数字1为value:

    scala> txt.flatMap(_.split(" ")).map((_, 1))
    res10: List[(String, Int)] = List((Hello,1), (mr,1), (apache,1), (spark,1), (Hello,1), (world,1), (apache,1), (spark,1), (Hello,1), (we,1), (want,1), (study,1), (spark,1), (Hello,1), (we,1), (want,1), (study,1), (apache,1), (Hello,1), (apache,1), (and,1), (hadoop,1))
    

      

    这样,我们就得到了一个list, list内存放的是一个个单独的map(元组,对偶元组)

    然后,按照key进行group

    scala> txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1)
    res21: scala.collection.immutable.Map[String,List[(String, Int)]] = Map(want -> List((want,1), (want,1)), world -> List((world,1)), hadoop -> List((hadoop,1)), spark -> List((spark,1), (spark,1), (spark,1)), apache -> List((apache,1), (apache,1), (apache,1), (apache,1)), Hello -> List((Hello,1), (Hello,1), (Hello,1), (Hello,1), (Hello,1)), mr -> List((mr,1)), we -> List((we,1), (we,1)), study -> List((study,1), (study,1)), and -> List((and,1)))
    

      

    groupBy后得到一个map, key是单词, value是 一个list,list存放上一条执行的元组也就是 (单词, 1) 这个元组

    实际上现在的这个map  key是单词, value这个list 的size 其实就是单词出现的次数了,所以我们要把value转换成出现的次数:

    scala> txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size))
    res2: scala.collection.immutable.Map[String,Int] = Map(want -> 2, world -> 1, hadoop -> 1, spark -> 3, apache -> 4, Hello -> 5, mr -> 1, we -> 2, study -> 2, and -> 1)
    

      

    使用map方法,内用一个匿名函数,t 代表的就是上面map中的一个元素(key 单词,value 是list 的那个元素), 然后

    函数的功能是 创建一个元组(map)key是单词, value就是原本元素value那个list 的size

    这样得到了单词为key 次数为value的一个map

    还有一种更为简洁的方法:

    scala> txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).mapValues(_.size)
    res14: scala.collection.immutable.Map[String,Int] = Map(want -> 2, world -> 1, hadoop -> 1, spark -> 3, apache -> 4, Hello -> 5, mr -> 1, we -> 2, study -> 2, and -> 1)
    

      

    3. 排序

    txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2)
    res5: List[(String, Int)] = List((world,1), (hadoop,1), (mr,1), (and,1), (want,2), (we,2), (study,2), (spark,3), (apache,4), (Hello,5))
    

      

    将得到的map(key 单词, value 次数)进行排序操作

    由于map不支持sortBy函数,将map转换成list在执行sortBy

    得到排序后的结果,但是我们要升序的,所以执行最后一步操作:

    txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2).reverse
    res8: List[(String, Int)] = List((Hello,5), (apache,4), (spark,3), (study,2), (we,2), (want,2), (and,1), (mr,1), (hadoop,1), (world,1))
    

      

    调用reverse方法反转即可

    这样就得到了我们要的结果

    所以总结就是:

    txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2).reverse
    

      

    scala> txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).mapValues(_.size).toList.sortBy(_._2).reverse
    res18: List[(String, Int)] = List((Hello,5), (apache,4), (spark,3), (study,2), (we,2), (want,2), (and,1), (mr,1), (hadoop,1), (world,1))
    

      

    打印出来看一看:

    scala> txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2).reverse.map(println)
    (Hello,5)
    (apache,4)
    (spark,3)
    (study,2)
    (we,2)
    (want,2)
    (and,1)
    (mr,1)
    (hadoop,1)
    (world,1)
    res10: List[Unit] = List((), (), (), (), (), (), (), (), (), ())
    

      

    或者:

    scala> for( i <- txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).mapValues(_.size).toList.sortBy(_._2).reverse) println(i)
    (Hello,5)
    (apache,4)
    (spark,3)
    (study,2)
    (we,2)
    (want,2)
    (and,1)
    (mr,1)
    (hadoop,1)
    (world,1)
    

      

    欢迎转载,欢迎提出意见

    如果本文对您有帮助,还请点击一下推荐哦,Thanks♪(・ω・)ノ

    https://www.cnblogs.com/bigdatacaoyu

  • 相关阅读:
    netty(4)高级篇-Websocket协议开发
    netty高级篇(3)-HTTP协议开发
    netty中级篇(2)
    netty入门篇(1)
    nio简介
    总账科目如何添加自定义属性?
    如何切换组织初次打开界面时,默认显示财务组织?
    超好用的免费Redis客户端
    Postman如何测试Webservice接口?
    创建Maven project 提示pom.xml 首行错误
  • 原文地址:https://www.cnblogs.com/bigdatacaoyu/p/10925404.html
Copyright © 2011-2022 走看看