zoukankan html css js c++ java

Hive实现wordCount

a. 创建一个数据库

create database word;

b. 建表

create external table word_data(line string) row format delimited fields terminated by '
' stored as textfile location '/home/hadoop/worddata';
这里假设我们的数据存放在hadoop下，路径为：/home/hadoop/worddata，里面主要是一些单词文件，内容大概为：

+-------------------------+--+
|     word_data.line      |
+-------------------------+--+
|                         |
| hello man               |
| what are you doing now  |
| my running              |
| hello                   |
| kevin                   |
| hi man                  |
| hadoop hive es          |
| storm hive es           |
|                         |
|                         |
+-------------------------+--+

执行了上述hql就会创建一张表src_data，内容是这些文件的每行数据，每行数据存在字段line中，

	select * from word_data;
	#就可以看到这些数据:
+-------------------------+--+
|     word_data.line      |
+-------------------------+--+
|                         |
| hello man               |
| what are you doing now  |
| my running              |
| hello                   |
| kevin                   |
| hi man                  |
| hadoop hive es          |
| storm hive es           |
|                         |
|                         |
+-------------------------+--+

c. 根据MapReduce的规则，需要进行拆分

把每行数据拆分成单词，这里需要用到一个hive的内置表生成函数（UDTF）：explode(array)，参数是array，
其实就是行变多列：

create table words(word string);
insert into table words select explode(split(line, " ")) as word from word_data;

0: jdbc:hive2://bd004:10000> select * from words;
+-------------+--+
| words.word  |
+-------------+--+
|             |
| hello       |
| man         |
| what        |
| are         |
| you         |
| doing       |
| now         |
| my          |
| running     |
| hello       |
| kevin       |
| hi          |
| man         |
| hadoop      |
| hive        |
| es          |
| storm       |
| hive        |
| es          |
|             |
|             |
+-------------+--+

split是拆分函数，跟java的split功能一样，这里是按照空格拆分，所以执行完hql语句，words表里面就全部保存的单个单词

d. 基本实现

因为hql可以group by，所以最后统计语句为：

select word, count(*) from word.words group by word;
#word.words 库名称.表名称，group by word这个word是create table words(word string) 命令创建的word string

+----------+------+--+
|   word   | _c1  |
+----------+------+--+
|          | 3    |
| are      | 1    |
| doing    | 1    |
| es       | 2    |
| hadoop   | 1    |
| hello    | 2    |
| hi       | 1    |
| hive     | 2    |
| kevin    | 1    |
| man      | 2    |
| my       | 1    |
| now      | 1    |
| running  | 1    |
| storm    | 1    |
| what     | 1    |
| you      | 1    |
+----------+------+--+

总结：对比写MR和使用hive，还是hive比较简便，对于比较复杂的统计操作可以建一些中间表，或者一些视图之类的。

查看全文

相关阅读:
java注解说明
 paypal
eclispe查看jdk源码后特别卡顿导致未响应解决
 ubuntu+tomcat，多环境、自动化部署脚本，git+maven+tomcat+ubuntu
ubuntu+let's encrypt生成永久免费https证书 ubuntu+tomcat+nginx+let's encrypt
MySQL创建数据库与创建用户以及授权
 linux查找并杀死进程shell
redmine安装笔记
 C#动态获取本机可用串口的两种方式
 C# 控件缩写规范

原文地址：https://www.cnblogs.com/ernst/p/12819169.html