zoukankan      html  css  js  c++  java
  • Hive实现wordCount

    a. 创建一个数据库

    create database word;
    

    b. 建表

    create external table word_data(line string) row format delimited fields terminated by '
    ' stored as textfile location '/home/hadoop/worddata';
    这里假设我们的数据存放在hadoop下,路径为:/home/hadoop/worddata,里面主要是一些单词文件,内容大概为:
    
    
    +-------------------------+--+
    |     word_data.line      |
    +-------------------------+--+
    |                         |
    | hello man               |
    | what are you doing now  |
    | my running              |
    | hello                   |
    | kevin                   |
    | hi man                  |
    | hadoop hive es          |
    | storm hive es           |
    |                         |
    |                         |
    +-------------------------+--+
    

    执行了上述hql就会创建一张表src_data,内容是这些文件的每行数据,每行数据存在字段line中,

    	select * from word_data;
    	#就可以看到这些数据:
    +-------------------------+--+
    |     word_data.line      |
    +-------------------------+--+
    |                         |
    | hello man               |
    | what are you doing now  |
    | my running              |
    | hello                   |
    | kevin                   |
    | hi man                  |
    | hadoop hive es          |
    | storm hive es           |
    |                         |
    |                         |
    +-------------------------+--+
    

    c. 根据MapReduce的规则,需要进行拆分

    把每行数据拆分成单词,这里需要用到一个hive的内置表生成函数(UDTF):explode(array),参数是array,
    其实就是行变多列:

    create table words(word string);
    insert into table words select explode(split(line, " ")) as word from word_data;
    
    0: jdbc:hive2://bd004:10000> select * from words;
    +-------------+--+
    | words.word  |
    +-------------+--+
    |             |
    | hello       |
    | man         |
    | what        |
    | are         |
    | you         |
    | doing       |
    | now         |
    | my          |
    | running     |
    | hello       |
    | kevin       |
    | hi          |
    | man         |
    | hadoop      |
    | hive        |
    | es          |
    | storm       |
    | hive        |
    | es          |
    |             |
    |             |
    +-------------+--+
    

    split是拆分函数,跟java的split功能一样,这里是按照空格拆分,所以执行完hql语句,words表里面就全部保存的单个单词

    d. 基本实现

    因为hql可以group by,所以最后统计语句为:

    select word, count(*) from word.words group by word;
    #word.words 库名称.表名称,group by word这个word是create table words(word string) 命令创建的word string
    
    
    +----------+------+--+
    |   word   | _c1  |
    +----------+------+--+
    |          | 3    |
    | are      | 1    |
    | doing    | 1    |
    | es       | 2    |
    | hadoop   | 1    |
    | hello    | 2    |
    | hi       | 1    |
    | hive     | 2    |
    | kevin    | 1    |
    | man      | 2    |
    | my       | 1    |
    | now      | 1    |
    | running  | 1    |
    | storm    | 1    |
    | what     | 1    |
    | you      | 1    |
    +----------+------+--+
    

    总结:对比写MR和使用hive,还是hive比较简便,对于比较复杂的统计操作可以建一些中间表,或者一些视图之类的。

  • 相关阅读:
    Microsoft .NET 框架资源基础(摘自msdn)
    cache的应用
    cache应用(asp.net 2.0 + sqlserver2005 数据依赖缓存 )
    c#遍历查找指定文件
    各浏览器目前对CSS3、HTML5的支持
    一步步构建大型网站架构
    c#连接sqlserver数据库
    C#中如何判断一个字符是汉字
    c#执行DOC命令
    VS2010快捷键
  • 原文地址:https://www.cnblogs.com/ernst/p/12819169.html
Copyright © 2011-2022 走看看