zoukankan      html  css  js  c++  java
  • [Hive_add_11] Hive 使用 UDTF 实现日志降维


    0. 说明

      对日志进行降维处理,将日志分为几个小表

      通过编写 UDTF ,对日志降维,将日志聚合体相关字段抽取出来,形成新表。


    1. 操作流程

      1.0 日志部分内容

    1532868065.518#192.168.23.1#1532868067548#200#{"appChannel":"appstore","appErrorLogs":

    [{"createdAtMs":1530455040000,"errorBrief":"at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)",
    "errorDetail":"at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67) at sun.reflect.DelegatingMethodAccessorImpl.
    invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606)"},{"createdAtMs":1530393180000,
    "errorBrief":"at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)","errorDetail":
    "at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)"}
    ],"appEventLogs":[{"createdAtMs":1530412800000,"eventId":
    "share","logType":"event","mark":"4","musicID":"傲红尘"}
    ],"appPageLogs":[{"createdAtMs":1530449520000,
    "logType":"page","nextPage":"list.html","pageId":"list.html","pageViewCntInSession":0,"visitIndex":"2"}
    ],
    "appPlatform":"ios","appStartupLogs":[{"brand":"联想","carrier":"中国联通","country":"china","createdAtMs":
    1530385560000,"logType":"startup","network":"cell","province":"hebei","screenSize":"960 * 640"},
    {"brand":"魅族","carrier":"中国铁通","country":"america","createdAtMs":1530412860000,"logType":"startup",
    "network":"3g","province":"guangxi","screenSize":"480 * 320"}
    ],"appUsageLogs":[{"createdAtMs":1530379200000,
    "logType":"usage","singleDownloadTraffic":"12800","singleUploadTraffic":"128","singleUseDurationSecs":"123"}
    ],
    "appVersion":"1.0.0","deviceId":"Device000099","deviceStyle":"oppo 1","osType":"1.4.0"}

      1.1 创建 logAgg表

      创建 logAgg表,分区表 => year, month, day

        create table logAgg(serverTime string,remoteIp string,clientTime string,status string, json string)
        partitioned by(year string, month string, day string)
        row format delimited
        fields terminated by '#' ;

      1.2 load 数据到 logAgg表

      load data local inpath '/home/centos/files/2018-07-01.log' into table logagg partition(year='2018',month='07',day='01');

      1.3 降维处理

      1. 代码编写

      ParseJsonUtil.java

      ParseEvent.java

      2. 上传并同步

      先打包再放入 /soft/hive/lib 中

        cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/
    xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

      3. 注册临时函数

        create temporary function parseEvent as 'com.share.udtf.ParseEvent';

      4.测试

        select parseEvent(json) from logAgg;

      1.4 创建 logEvent表

        create table logevent(deviceId string, createdAtMs string, eventId string, logType string , mark string, musicID string)
        stored as parquet tblproperties('parquet.compression'='GZIP');

      1.5 转储

        insert into logevent 
        select parseEvent(json) from logagg
        where year='2018' and month='07' and day='01';

      1.6 对 logEvent表进行操作

      1. 计算每个用户对每首歌的评分

    select deviceid, musicid, sum(cast(mark as int)) as sum
    from logevent
    where musicId is not null
    group by deviceid, musicid;

      2. 计算每个用户对每首歌的评分与最高评分

    select deviceid , musicid, sum, max(sum)over(partition by deviceid) as sum2
    from (
    select deviceid, musicid, sum(cast(mark as int)) as sum
    from logevent where musicId is not null group by deviceid, musicid
    )a;

      3. 使用 sql 计算出,每个用户最喜欢(评分最高)的歌曲,及其评分

    select deviceid , musicid, sum from (
    select deviceid , musicid, sum, max(sum)over(partition by deviceid) as sum2
    from (
    select deviceid, musicid, sum(cast(mark as int)) as sum
    from logevent where musicId is not null group by deviceid, musicid
    )a
    )b 
    where sum=sum2 ;

    2. 附加内容 

      2.1 Hive 加载 jar 方式

      1. 将 jar 放在 Hive 的 lib 下,并重启 Hive
        /soft/hive/lib

      2. 通过配置文件指定 jar,并重启 Hive
        hive.aux.jars.path=/x/x/x.jar

      3. 临时加载 jar
        hive> add jar /x/x/x.jar

      2.2 注册函数方法

      1. create function as '';    // 永久

      2. create temporary function as '';    // 临时

      3. create function xxx as '' using jar 'hdfs://mycluster/xxx.jar';    // 将 jar 包放入 HDFS 中,避免重启


  • 相关阅读:
    微信小程序实战练习(仿五洲到家微信版)
    vue2.0项目 calendar.js(日历组件封装)
    基于thinkphp的后台管理系统模板快速搭建
    你不知道的javascript(上卷)读后感(二)
    你不知道的javascript(上卷)读后感(一)
    教你10分钟搭建酷炫的个人博客
    Webpack学习-工作原理(下)
    Webpack学习-工作原理(上)
    Css Secret 案例Demo全套
    基于excel导入数据到ms sql server
  • 原文地址:https://www.cnblogs.com/share23/p/10361382.html
Copyright © 2011-2022 走看看