zoukankan      html  css  js  c++  java
  • hive语句嵌入python脚本(进行map和reduce,实现左外连接)

    在Hive语句中使用脚本(如python和shell)进行map和reduce:利用命令transform(或者指定map和reduce),配合加入的脚本文件add file

    请看:http://www.coder4.com/archives/4052


    别名后面as省略也行,空格直接加,如: table app_stats t1, app_data t2;

    先举一个小例子:

    add file ${python_script_path}/lanch_interval_count.py;
    drop table temp_lanch_interval2;
    create table temp_lanch_interval2 as
    select reportdate, appid,channelname, app_version, deviceid,ts,sameday
    from
    (
      from
       (
         from
          (
            select fl.reportdate, fl.appid, 1 as app_version,fn.channelname,fl.deviceid,fl.linux_time
                           from (select reportdate, appid, app_version,deviceid,linux_time  from factloglanch WHERE dt>=  ?  and dt<=  ?  ) fl
    left outer join factnewuser_nodimid fn on (fl.deviceid = fn.deviceid and fl.appid = fn.appid)
          ) a
         map reportdate, appid, channelname,app_version, deviceid,linux_time  using '/bin/cat'
         as reportdate, appid, channelname,app_version, deviceid,linux_time
         cluster by appid, channelname,deviceid
       ) b
       reduce reportdate, appid, channelname,app_version, deviceid,linux_time using 'lanch_interval_count.py'
              as reportdate, appid,app_version,  channelname,deviceid,ts,sameday
    ) c

    具体说明,引一篇讲的很好的博客:http://www.coder4.com/archives/4052

    Hive中的TRANSFORM:使用脚本完成Map/Reduce

    hive> select * from test;
    OK
    1       3
    2       2
    3       1


    要输出每一列的md5值,hive中是没有这个udf,用Python的代码#!/home/tops/bin/python

    #!/home/tops/bin/python
    import sys
    import hashlib

    for line in sys.stdin:
        line = line.strip()
        arr = line.split()
        md5_arr = []
        for a in arr:
            md5_arr.append(hashlib.md5(a).hexdigest())
        print " ".join(md5_arr)
     
    在Hive中使用脚本(如,python和shell),首先要将他们加入:
    add file /xxxx/test.py

    然后,在程序中使用TRANSFORM语法调用:
    SELECT
        TRANSFORM (col1, col2) USING './test.py' AS (new1, new2)
    FORM test;
    其中,AS指定输出列,分别对应的列名。如果省略这句,Hive会将第1个tab前的结果作为key,后面其余作为value。
    注意:TRANSFORM的分割符号,永远是 。传入、传出脚本时都默认必须使用 。没有其他分隔符
    所以会出问题,在结合INSERT [OVERWRITE] table使用时,目标表的分隔符不是 ,是其他分隔符如';',
    这样就会出错。

    直接使用map 和reduce命令:

    SELECT MAP (…)  USING ‘xx.py’是使用的语法,
    MAP、REDUCE只不过是TRANSFORM的别名,Hive不保证一定会在map/reduce中调用脚本。看看官方文档是怎么说的:
    Formally, MAP ... and REDUCE ... are syntactic transformations of SELECT TRANSFORM ( ... ). In
    other words, they serve as comments or notes to the reader of the query.
    BEWARE: Use of these keywords may be dangerous as (e.g.) typing "REDUCE" does not force a reduce phase
    to occur and typing "MAP" does not force a new map phase!
    所以,混用map reduce语法关键字可能会引起混淆,所以建议都用TRANSFORM。
    如果不是脚本文件,而是awk、sed等系统内置命令,可以直接使用(不用add file),如:
    map reportdate, appid, channelname,app_version, deviceid,linux_time  using '/bin/cat'
         as reportdate, appid, channelname,app_version, deviceid,linux_time
         cluster by appid, channelname,deviceid
     
    如果,表中有MAP,ARRAY等复杂类型,
    CREATE TABLE features
    (
        id BIGINT,
        norm_features MAP<STRING, FLOAT>
    );
    用TRANSFORM命令进行操作,就是将脚本文件的输出,设置为对应格式,Python里面就是print出对应的格式,而复杂类型就用其对应的分隔符
    如,MAP类型的KV分割符。
    SELECT TRANSFORM(stuff)
    USING 'script'
    AS (thing1 INT, thing2 MAP<STRING, FLOAT>)
     
     
     

  • 相关阅读:
    51 数据中重复的数字
    64 数据流中的中位数
    79. Word Search
    93. Restore IP Addresses
    547. Friend Circles
    Epplus Excel 导入 MSSQL 数据库
    用来更新服务的bat 脚本
    ASP.Net MVC 引用动态 js 脚本
    8、结构的构造器应该显式调用!!!(坑)
    Task 线程重用导致等待!
  • 原文地址:https://www.cnblogs.com/cl1024cl/p/6205462.html
Copyright © 2011-2022 走看看