zoukankan      html  css  js  c++  java
  • Hive实现wordCount

    a. 创建一个数据库

    create database word;
    

    b. 建表

    create external table word_data(line string) row format delimited fields terminated by '
    ' stored as textfile location '/home/hadoop/worddata';
    这里假设我们的数据存放在hadoop下,路径为:/home/hadoop/worddata,里面主要是一些单词文件,内容大概为:
    
    
    +-------------------------+--+
    |     word_data.line      |
    +-------------------------+--+
    |                         |
    | hello man               |
    | what are you doing now  |
    | my running              |
    | hello                   |
    | kevin                   |
    | hi man                  |
    | hadoop hive es          |
    | storm hive es           |
    |                         |
    |                         |
    +-------------------------+--+
    

    执行了上述hql就会创建一张表src_data,内容是这些文件的每行数据,每行数据存在字段line中,

    	select * from word_data;
    	#就可以看到这些数据:
    +-------------------------+--+
    |     word_data.line      |
    +-------------------------+--+
    |                         |
    | hello man               |
    | what are you doing now  |
    | my running              |
    | hello                   |
    | kevin                   |
    | hi man                  |
    | hadoop hive es          |
    | storm hive es           |
    |                         |
    |                         |
    +-------------------------+--+
    

    c. 根据MapReduce的规则,需要进行拆分

    把每行数据拆分成单词,这里需要用到一个hive的内置表生成函数(UDTF):explode(array),参数是array,
    其实就是行变多列:

    create table words(word string);
    insert into table words select explode(split(line, " ")) as word from word_data;
    
    0: jdbc:hive2://bd004:10000> select * from words;
    +-------------+--+
    | words.word  |
    +-------------+--+
    |             |
    | hello       |
    | man         |
    | what        |
    | are         |
    | you         |
    | doing       |
    | now         |
    | my          |
    | running     |
    | hello       |
    | kevin       |
    | hi          |
    | man         |
    | hadoop      |
    | hive        |
    | es          |
    | storm       |
    | hive        |
    | es          |
    |             |
    |             |
    +-------------+--+
    

    split是拆分函数,跟java的split功能一样,这里是按照空格拆分,所以执行完hql语句,words表里面就全部保存的单个单词

    d. 基本实现

    因为hql可以group by,所以最后统计语句为:

    select word, count(*) from word.words group by word;
    #word.words 库名称.表名称,group by word这个word是create table words(word string) 命令创建的word string
    
    
    +----------+------+--+
    |   word   | _c1  |
    +----------+------+--+
    |          | 3    |
    | are      | 1    |
    | doing    | 1    |
    | es       | 2    |
    | hadoop   | 1    |
    | hello    | 2    |
    | hi       | 1    |
    | hive     | 2    |
    | kevin    | 1    |
    | man      | 2    |
    | my       | 1    |
    | now      | 1    |
    | running  | 1    |
    | storm    | 1    |
    | what     | 1    |
    | you      | 1    |
    +----------+------+--+
    

    总结:对比写MR和使用hive,还是hive比较简便,对于比较复杂的统计操作可以建一些中间表,或者一些视图之类的。

  • 相关阅读:
    简述SQL with(unlock)与with(readpast)
    SQLServer 查询最近一天,三天,一周,一月,一季度数据的方法
    C# DevExpress GridControl使用方法
    SQL一列的合并连起来
    DevExpress Report打印边距越界问题
    C# 快速高效率复制对象另一种方式 表达式树
    SQL传数组到存储过程中
    LogNet4学习笔记
    使用Squid部署代理缓存服务(标准正向、透明正反向代理)
    使用Postfix与Dovecot收发电子邮件(物理机虚拟机之间)
  • 原文地址:https://www.cnblogs.com/ernst/p/12819169.html
Copyright © 2011-2022 走看看