zoukankan      html  css  js  c++  java
  • Hive中自定义Map/Reduce示例 In Java

    Hive支持自定义map与reduce script。接下来我用一个简单的wordcount例子加以说明。

    如果自己使用Java开发,需要处理System.in,System,out以及key/value的各种逻辑,比较麻烦。有人开发了一个小框架,可以让我们使用与Hadoop中map与reduce相似的写法,只关注map与reduce即可。如今此框架已经集成在Hive中,就是$HIVE_HOME/lib/hive-contrib-2.3.0.jar,hive版本不同,对应的contrib名字可能不同。

    开发工具:intellij
    JDK:jdk1.7
    hive:2.3.0
    hadoop:2.8.1

    一、开发map与reduce

    “map类
    public class WordCountMap {
        public static void main(String args[]) throws Exception{
            new GenericMR().map(System.in, System.out, new Mapper() {
                @Override
                public void map(String[] strings, Output output) throws Exception {
                    for(String str:strings){
                        String[] strs=str.split("\W+");//如果源文本文件是以	分隔的,则不需要再拆分,传入的strings就是每行拆分好的单词
                        for(String str_2:strs) {
                            output.collect(new String[]{str_2, "1"});
                        }
                    }
                }
            });
        }
    }
    "reduce类
    public class WordCountReducer {
        public static void main(String args[]) throws Exception{
            new GenericMR().reduce(System.in, System.out, new Reducer() {
                @Override
                public void reduce(String s, Iterator<String[]> iterator, Output output) throws Exception {
                    int sum=0;
                    while(iterator.hasNext()){
                        Integer count=Integer.valueOf(iterator.next()[1]);
                        sum+=count;
                    }
                    output.collect(new String[]{s,String.valueOf(sum)});
                }
            });
        }
    }

    二、导出jar包

    然后导出Jar包(包含hive-contrib-2.3.0),假如导出jar包名为wordcount.jar

     
    File->Project Structure
     
     
    add Artifacts
     

    不用填写Main Class,直接点击OK
     
    jar包配置
     
    生成jar包
     

    三、编写hive sql

    drop table if exists raw_lines;
    
    -- create table raw_line, and read all the lines in '/user/inputs', this is the path on your local HDFS
    create external table if not exists raw_lines(line string)
    ROW FORMAT DELIMITED
    stored as textfile
    location '/user/inputs';
    
    drop table if exists word_count;
    
    -- create table word_count, this is the output table which will be put in '/user/outputs' as a text file, this is the path on your local HDFS
    
    create external table if not exists word_count(word string, count int)
     ROW FORMAT DELIMITED
     FIELDS TERMINATED BY '	'
     lines terminated by '
    ' STORED AS TEXTFILE LOCATION '/user/outputs/';
    
    
    -- add the mapper&reducer scripts as resources, please change your/local/path
    --must use "add file",not "add jar",or,hive won't find map and reduce main class
    add file your/local/path/wordcount.jar;
    
    from (
            from raw_lines
            map raw_lines.line
            --call the mapper here
            using 'java -cp wordcount.jar WordCountMap'
            as word, count
            cluster by word) map_output
    insert overwrite table word_count
    reduce map_output.word, map_output.count
    --call the reducer here
    using 'java -cp wordcount.jar WordCountReducer'
    as word,count;

    此hive sql保存为wordcount.hql

    四、执行hive sql

    beeline -u [hiveserver] -n username -f wordcount.hql

    简单说下Hive的自定义map与reduce内部原理:
    hive读取文本文件,然后将其一行行输入系统标准输入中,用户自定义的Map读取标准输入流中数据,一行行处理,然后将其按照一定格式(例如:"key value")输出到标准输出流中,然后hive会将输出的字符串进行排序,然后再送到标准输入流中,Reduce再从标准输入流中读取数据进行相应处理,处理完成后,再送到标准输出流中,Hive再对Reduce结果进行处理存入表中。

  • 相关阅读:
    【C语言程序设计】C语言回文数怎么求?
    今天又要加班了,谁叫我是程序员!
    【编程入门】C语言字符串的加密和解密算法!
    某程序员吐槽:媳妇要给孩子报少儿编程班,将来继续做程序员!自己拿命换钱,难道后代也要继续拿命换钱?
    Linux 误删文件恢复命令及方法!
    后端程序员的成长之路:从菜鸟到架构!
    C语言基础教程 之 如何定义变量!
    程序员如何知晓自己被绿?在网上,面对黑客男朋友的你将毫无秘密可言!
    【编程书库】入门+进阶C语言,这几本就够了!
    【C语言笔记】ASCII码可见字符与不可见字符!
  • 原文地址:https://www.cnblogs.com/mycodingworld/p/hive_mapred_java.html
Copyright © 2011-2022 走看看