【原创】大数据基础之词频统计Word Count - 走看看

zoukankan html css js c++ java

【原创】大数据基础之词频统计Word Count
对文件进行词频统计，是一个大数据领域的hello word级别的应用，来看下实现有多简单：

1 Linux单机处理

egrep -o "[[:alpha:]]+" test_word.log|sort|uniq -c|sort -rn|head -10

2 Scala单机处理（Array）
line.split(" ").map((_, 1)).groupBy(_._1).map(_._2.reduce((v1, v2) => (v1._1, v1._2 + v2._2))).toArray.sortWith(_._2 > _._2).foreach(println)
3 Spark分布式处理（Scala）
val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) sc.textFile("test_word.log").flatMap(_.split("\s+")).map((_, 1)).reduceByKey(_ + _).sortBy(_._2, false).take(10).foreach(println)
4 Flink分布式处理（Scala）
val env = ExecutionEnvironment.getExecutionEnvironment env.readTextFile("test_word.log").flatMap(_.toLowerCase.split("\s+").map((_, 1)).groupBy(0).sum(1).sortPartition(1, Order.DESCENDING).first(10).print
5 MongoDB

>db.table_name.mapReduce(function(){ emit(this.column,1);}, function(key, values){return Array.sum(values);}, {out:"post_total"})

6 Hadoop示例

hadoop jar /path/hadoop-2.6.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.1.jar wordcount /tmp/wordcount/input /tmp/wordcount/output

附：测试文件test_word.log内容如下：

hello world
hello www

输出如下：

2 hello
1 world
1 www
查看全文

相关阅读:
jQuery拾忆
 关于在审查元素中看到的::before与::after
Spring MVC数据绑定
 最近要了解的
 MySql去重
 数据库去重与join连表
 Spring jdbcTemplate RowMapper绑定任意对象
 二十九、利用 IntelliJ IDEA 进行代码对比的方法
 二十八、详述 IntelliJ IDEA 远程调试 Tomcat 的方法
 二十七、详述 IntelliJ IDEA 设置 Sublime 代码颜色的方法

原文地址：https://www.cnblogs.com/barneywill/p/10115301.html

Copyright © 2011-2022 走看看