zoukankan      html  css  js  c++  java
  • Hive使用简介

    指定分隔符

    HIVE输出到文件的分隔符
    ,列与列之间是'1'(ASCII码1,在vim里显示为^A),列内部随着层数增加,分隔符依次为'2','3','4'等。
    例:输出为map<string,string>, array, int类型,则分隔符为:
    key1 3 value1 2 key2 3 value2 ... 1 value1 2 value2 2 value3 ... 1 第三列的int值
    如何指定分隔符:
    INSERT OVERWRITE DIRECTORY 'xx' ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ' ' [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char]
    SELECT ...;

    文件上传

    用户可用ADD命令上传需要的辅助文件,如建临时表需要导入的数据,自己写的UDF的jar包等。目前客户端上传的本地文件大小限制为200M,如果超过这个限制,请先自己将文件上传到HDFS上,然后再执行ADD命令,file为全路径的URI,形如: hdfs://host:port/your_file_path.
    利用ADD上传文件:

    1. ADD FILE local_filename
    2. ADD CONF local_conf
    3. ADD JAR local_jar
    4. ADD ARCHIVE local.tar.gz/local.tgz
    5. ADD CACHEFILE/CACHEARCHIVE hdfs://host:port/your_hdfs_file

    Hive的几种排序

    排序方式 说明
    order by key 全局按照key排序,使用单个reduce进行排序,数据量不宜过大,线上环境必须加limit, 否则会报错
    sort by key 单个reduce范围内按照key排序
    distribute by key 按照指定字段分发到不同的reduce中,进而生成包含不同key的文件
    cluster by key 按照key分发后在reducer范围内排序,相当于 distribute by + sort by

    TRANSFORM语法

    Hive的TRANSFORM功能,会以子进程方式启动用户编写的程序,把输入数据以管道方式传递到用户程序的标准输入(各列间分隔符为' ', 输入格式与select出来的结果一致),用户程序进行处理以后,把输出结果写在标准输出(分隔符为' ')传递给hive。可利用TRANSFORM语法嵌入自定义的mapper和reducer程序。

    注意:1. USING子句后的程序必须具有可执行权限,用户add file添加的程序默认是不具有执行权限的,解决方法有两个:
    a. USING 'bash xx.sh' 或 USING /home/sharelib/python/bin/python xx.py'
    b. 把你的脚本加上可执行权限,然后打成压缩包,用add archive的方式。
    ADD ARCHIVE xx.tar.gz;
    SELECT TRANSFROM(a, b) USING 'xx.tar.gz/xx.sh' AS(a, b) FROM t;(注意前面一定要用压缩文件名xx.tar.gz,不然会找不到文件)

    使用Transform语法的示例作业

    示例1:仅有mapper的收集特定集合中的URL相关日志

    SQL语句文件:

    set mapred.max.split.size=128000000;
    set mapred.job.map.capacity=1000;
    set mapred.map.memory.limit=1000;
    set stream.memory.limit=900;
    set mapred.reduce.tasks=100;
    set mapred.job.name=word_count_demo;
    
    ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
    ADD FILE pc_vip_1w_norm;
    ADD FILE mapper_clicklog_finder.py;
    ADD FILE url_normalizer_tool_without_protocal;
    ADD FILE mapper_clicklog_finder.sh;
    
    INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon_norm" 
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '	'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':'
    SELECT TRANSFORM(event_click_target_url, event_useragent)
    USING 'sh mapper_clicklog_finder.sh'
    AS(url, ua)
    FROM udw.udw_event
    WHERE event_action = 'query_click' AND event_day >= '20180515' AND event_day <= '20180615'
    CLUSTER BY url;
    

    mapper_clicklog_finder.sh代码内容如下:

    #!/bin/bash
    
    chmod a+x url_normalizer_tool_without_protocal
    
    ./url_normalizer_tool_without_protocal 2>/dev/null 
            | python2.7.2.tgz/python2.7.2/bin/python mapper_clicklog_finder.py pc_vip_1w_norm
    

    mapper_clicklog_find.py代码内容如下:

    import sys
    
    def load_url_set(filename):
        urlset = set()
        fp = open(filename, 'rt')
    
        for line in fp:
            url = line.strip()
            urlset.add(url)
    
        fp.close()
        return urlset
    
    if __name__ == '__main__':
        filename = sys.argv[1]
        urlset = load_url_set(filename)
    
        for line in sys.stdin:
            fields = line.strip().split('	')
            if len(fields) != 2:
                continue
            clickurl = fields[0]
            ua_str = fields[1]
            if clickurl in urlset:
                print('%s	%s' %(clickurl, ua_str))
    
    示例2:有mapper+reducer的统计URL次数的程序

    SQL语句文件:

    set mapred.reduce.tasks=10;
    
    ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
    ADD FILE mapper_task.py;
    ADD FILE reducer_task.py;
    
    FROM (
        FROM default.hpb_dws_bhv_news_1d
        SELECT TRANSFORM(rid, r_url) 
        USING 'python2.7.2.tgz/python2.7.2/bin/python mapper_task.py' 
        AS(url, cnt) 
        WHERE event_day >= '20180611' AND event_day <= '20180617' AND r_url != '' 
        CLUSTER BY url) mapper_output
    INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon" 
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '	'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':'
    SELECT TRANSFORM(mapper_output.url, mapper_output.cnt) USING 'python2.7.2.tgz/python2.7.2/bin/python reducer_task.py' AS (url, cnt);
    

    mapper_task.py文件:

    import sys
    for line in sys.stdin:
        fields = line.rstrip('
    ').split('	')
        rid, r_url = fields
        sys.stdout.write('%s	1' %r_url)
    

    reducer_task.py文件:

    import sys
    
    url_pre = None
    cnt_pre = 0
    
    for line in sys.stdin:
        fields = line.strip().split('	')
        url = fields[0]
        cnt = int(fields[1])
        if url != url_pre:
            if url_pre:
                print('%s	%d' %(url_pre, cnt_pre))
            cnt_pre = 0
        url_pre = url
        cnt_pre += cnt
    
    if url_pre:
        print('%s	%d' %(url_pre, cnt_pre))
    

    详细的TRANSFORM相关说明文档参见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

    指定分隔符

    HIVE输出到文件的分隔符
    ,列与列之间是'1'(ASCII码1,在vim里显示为^A),列内部随着层数增加,分隔符依次为'2','3','4'等。
    例:输出为map<string,string>, array, int类型,则分隔符为:
    key1 3 value1 2 key2 3 value2 ... 1 value1 2 value2 2 value3 ... 1 第三列的int值
    如何指定分隔符:
    INSERT OVERWRITE DIRECTORY 'xx' ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ' ' [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char]
    SELECT ...;

    文件上传

    用户可用ADD命令上传需要的辅助文件,如建临时表需要导入的数据,自己写的UDF的jar包等。目前客户端上传的本地文件大小限制为200M,如果超过这个限制,请先自己将文件上传到HDFS上,然后再执行ADD命令,file为全路径的URI,形如: hdfs://host:port/your_file_path.
    利用ADD上传文件:

    1. ADD FILE local_filename
    2. ADD CONF local_conf
    3. ADD JAR local_jar
    4. ADD ARCHIVE local.tar.gz/local.tgz
    5. ADD CACHEFILE/CACHEARCHIVE hdfs://host:port/your_hdfs_file

    Hive的几种排序

    排序方式 说明
    order by key 全局按照key排序,使用单个reduce进行排序,数据量不宜过大,线上环境必须加limit, 否则会报错
    sort by key 单个reduce范围内按照key排序
    distribute by key 按照指定字段分发到不同的reduce中,进而生成包含不同key的文件
    cluster by key 按照key分发后在reducer范围内排序,相当于 distribute by + sort by

    TRANSFORM语法

    Hive的TRANSFORM功能,会以子进程方式启动用户编写的程序,把输入数据以管道方式传递到用户程序的标准输入(各列间分隔符为' ', 输入格式与select出来的结果一致),用户程序进行处理以后,把输出结果写在标准输出(分隔符为' ')传递给hive。可利用TRANSFORM语法嵌入自定义的mapper和reducer程序。

    注意:1. USING子句后的程序必须具有可执行权限,用户add file添加的程序默认是不具有执行权限的,解决方法有两个:
    a. USING 'bash xx.sh' 或 USING /home/sharelib/python/bin/python xx.py'
    b. 把你的脚本加上可执行权限,然后打成压缩包,用add archive的方式。
    ADD ARCHIVE xx.tar.gz;
    SELECT TRANSFROM(a, b) USING 'xx.tar.gz/xx.sh' AS(a, b) FROM t;(注意前面一定要用压缩文件名xx.tar.gz,不然会找不到文件)

    使用Transform语法的示例作业

    示例1:仅有mapper的收集特定集合中的URL相关日志

    SQL语句文件:

    set mapred.max.split.size=128000000;
    set mapred.job.map.capacity=1000;
    set mapred.map.memory.limit=1000;
    set stream.memory.limit=900;
    set mapred.reduce.tasks=100;
    set mapred.job.name=word_count_demo;
    
    ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
    ADD FILE pc_vip_1w_norm;
    ADD FILE mapper_clicklog_finder.py;
    ADD FILE url_normalizer_tool_without_protocal;
    ADD FILE mapper_clicklog_finder.sh;
    
    INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon_norm" 
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '	'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':'
    SELECT TRANSFORM(event_click_target_url, event_useragent)
    USING 'sh mapper_clicklog_finder.sh'
    AS(url, ua)
    FROM udw.udw_event
    WHERE event_action = 'wiseps_query_click' AND event_day >= '20180515' AND event_day <= '20180615'
    CLUSTER BY url;
    

    mapper_clicklog_finder.sh代码内容如下:

    #!/bin/bash
    
    chmod a+x url_normalizer_tool_without_protocal
    
    ./url_normalizer_tool_without_protocal 2>/dev/null 
            | python2.7.2.tgz/python2.7.2/bin/python mapper_clicklog_finder.py pc_vip_1w_norm
    

    mapper_clicklog_find.py代码内容如下:

    import sys
    
    def load_url_set(filename):
        urlset = set()
        fp = open(filename, 'rt')
    
        for line in fp:
            url = line.strip()
            urlset.add(url)
    
        fp.close()
        return urlset
    
    if __name__ == '__main__':
        filename = sys.argv[1]
        urlset = load_url_set(filename)
    
        for line in sys.stdin:
            fields = line.strip().split('	')
            if len(fields) != 2:
                continue
            clickurl = fields[0]
            ua_str = fields[1]
            if clickurl in urlset:
                print('%s	%s' %(clickurl, ua_str))
    
    示例2:有mapper+reducer的统计URL次数的程序

    SQL语句文件:

    set mapred.reduce.tasks=10;
    
    ADD CACHEARCHIVE hdfs://host:port/your_hdfs_path/python2.7.2.tgz;
    ADD FILE mapper_task.py;
    ADD FILE reducer_task.py;
    
    FROM (
        FROM default.hpb_dws_bhv_news_1d
        SELECT TRANSFORM(rid, r_url) 
        USING 'python2.7.2.tgz/python2.7.2/bin/python mapper_task.py' 
        AS(url, cnt) 
        WHERE event_day >= '20180611' AND event_day <= '20180617' AND r_url != '' 
        CLUSTER BY url) mapper_output
    INSERT OVERWRITE DIRECTORY "hdfs://host:port/your_hdfs_path/click_data_1mon" 
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '	'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':'
    SELECT TRANSFORM(mapper_output.url, mapper_output.cnt) USING 'python2.7.2.tgz/python2.7.2/bin/python reducer_task.py' AS (url, cnt);
    

    mapper_task.py文件:

    import sys
    for line in sys.stdin:
        fields = line.rstrip('
    ').split('	')
        rid, r_url = fields
        sys.stdout.write('%s	1' %r_url)
    

    reducer_task.py文件:

    import sys
    
    url_pre = None
    cnt_pre = 0
    
    for line in sys.stdin:
        fields = line.strip().split('	')
        url = fields[0]
        cnt = int(fields[1])
        if url != url_pre:
            if url_pre:
                print('%s	%d' %(url_pre, cnt_pre))
            cnt_pre = 0
        url_pre = url
        cnt_pre += cnt
    
    if url_pre:
        print('%s	%d' %(url_pre, cnt_pre))
    

    详细的TRANSFORM相关说明文档参见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

  • 相关阅读:
    C++数据类型与C#对应关系 c#调用WINDWOS API时,非常有用(转)
    Web应用系统中关闭Excel进程
    jquery下一个空格带来的血案
    导出Excel时发生COM组件失败的解决方案
    水晶报表的交叉表中增加超级链接
    JavaScript和ExtJS的继承 Ext.extend Ext.applyIf (转)
    SQL SERVER 2000数据库置疑处理
    PHP中对淘宝URL中ID提取
    树莓派+蓝牙适配器连接蓝牙设备
    树莓派摄像头模块转成H264编码通过RTMP实现Html输出
  • 原文地址:https://www.cnblogs.com/jeromeblog/p/10058039.html
Copyright © 2011-2022 走看看