zoukankan      html  css  js  c++  java
  • Hive实现wordCount

    a. 创建一个数据库

    create database word;
    

    b. 建表

    create external table word_data(line string) row format delimited fields terminated by '
    ' stored as textfile location '/home/hadoop/worddata';
    这里假设我们的数据存放在hadoop下,路径为:/home/hadoop/worddata,里面主要是一些单词文件,内容大概为:
    
    
    +-------------------------+--+
    |     word_data.line      |
    +-------------------------+--+
    |                         |
    | hello man               |
    | what are you doing now  |
    | my running              |
    | hello                   |
    | kevin                   |
    | hi man                  |
    | hadoop hive es          |
    | storm hive es           |
    |                         |
    |                         |
    +-------------------------+--+
    

    执行了上述hql就会创建一张表src_data,内容是这些文件的每行数据,每行数据存在字段line中,

    	select * from word_data;
    	#就可以看到这些数据:
    +-------------------------+--+
    |     word_data.line      |
    +-------------------------+--+
    |                         |
    | hello man               |
    | what are you doing now  |
    | my running              |
    | hello                   |
    | kevin                   |
    | hi man                  |
    | hadoop hive es          |
    | storm hive es           |
    |                         |
    |                         |
    +-------------------------+--+
    

    c. 根据MapReduce的规则,需要进行拆分

    把每行数据拆分成单词,这里需要用到一个hive的内置表生成函数(UDTF):explode(array),参数是array,
    其实就是行变多列:

    create table words(word string);
    insert into table words select explode(split(line, " ")) as word from word_data;
    
    0: jdbc:hive2://bd004:10000> select * from words;
    +-------------+--+
    | words.word  |
    +-------------+--+
    |             |
    | hello       |
    | man         |
    | what        |
    | are         |
    | you         |
    | doing       |
    | now         |
    | my          |
    | running     |
    | hello       |
    | kevin       |
    | hi          |
    | man         |
    | hadoop      |
    | hive        |
    | es          |
    | storm       |
    | hive        |
    | es          |
    |             |
    |             |
    +-------------+--+
    

    split是拆分函数,跟java的split功能一样,这里是按照空格拆分,所以执行完hql语句,words表里面就全部保存的单个单词

    d. 基本实现

    因为hql可以group by,所以最后统计语句为:

    select word, count(*) from word.words group by word;
    #word.words 库名称.表名称,group by word这个word是create table words(word string) 命令创建的word string
    
    
    +----------+------+--+
    |   word   | _c1  |
    +----------+------+--+
    |          | 3    |
    | are      | 1    |
    | doing    | 1    |
    | es       | 2    |
    | hadoop   | 1    |
    | hello    | 2    |
    | hi       | 1    |
    | hive     | 2    |
    | kevin    | 1    |
    | man      | 2    |
    | my       | 1    |
    | now      | 1    |
    | running  | 1    |
    | storm    | 1    |
    | what     | 1    |
    | you      | 1    |
    +----------+------+--+
    

    总结:对比写MR和使用hive,还是hive比较简便,对于比较复杂的统计操作可以建一些中间表,或者一些视图之类的。

  • 相关阅读:
    分享50个漂亮的设计师个人作品集网站案例
    Jquery中$.get(),$.post(),$.ajax(),$.getJSON()的用法总结
    ajax异步请求
    angularJS(6)
    angularJS(5)
    PHP实现RTX发送消息提醒
    ajax异步请求
    angularJS(4)
    angularJS(3)
    jsPanel插件Option总结
  • 原文地址:https://www.cnblogs.com/ernst/p/12819169.html
Copyright © 2011-2022 走看看