zoukankan      html  css  js  c++  java
  • hive 面试题 转载

    转自:http://blog.csdn.net/ningguixin/article/details/12852051

    有一张很大的表:TRLOG
    该表大概有2T左右
    TRLOG:
    CREATE TABLE TRLOG
    (PLATFORM string,
    USER_ID int,
    CLICK_TIME string,
    CLICK_URL string)
    row format delimited
    fields terminated by ' ';

    数据:
    PLATFORM USER_ID CLICK_TIME CLICK_URL
    WEB 12332321 2013-03-21 13:48:31.324 /home/
    WEB 12332321 2013-03-21 13:48:32.954 /selectcat/er/
    WEB 12332321 2013-03-21 13:48:46.365 /er/viewad/12.html
    WEB 12332321 2013-03-21 13:48:53.651 /er/viewad/13.html
    WEB 12332321 2013-03-21 13:49:13.435 /er/viewad/24.html
    WEB 12332321 2013-03-21 13:49:35.876 /selectcat/che/
    WEB 12332321 2013-03-21 13:49:56.398 /che/viewad/93.html
    WEB 12332321 2013-03-21 13:50:03.143 /che/viewad/10.html
    WEB 12332321 2013-03-21 13:50:34.265 /home/
    WAP 32483923 2013-03-21 23:58:41.123 /m/home/
    WAP 32483923 2013-03-21 23:59:16.123 /m/selectcat/fang/
    WAP 32483923 2013-03-21 23:59:45.123 /m/fang/33.html
    WAP 32483923 2013-03-22 00:00:23.984 /m/fang/54.html
    WAP 32483923 2013-03-22 00:00:54.043 /m/selectcat/er/
    WAP 32483923 2013-03-22 00:01:16.576 /m/er/49.html
    …… …… …… ……

    需要把上述数据处理为如下结构的表ALLOG:
    CREATE TABLE ALLOG
    (PLATFORM string,
    USER_ID int,
    SEQ int,
    FROM_URL string,
    TO_URL string)
    row format delimited
    fields terminated by ' ';

    整理后的数据结构:
    PLATFORM USER_ID SEQ FROM_URL TO_URL
    WEB 12332321 1 NULL /home/
    WEB 12332321 2 /home/ /selectcat/er/
    WEB 12332321 3 /selectcat/er/ /er/viewad/12.html
    WEB 12332321 4 /er/viewad/12.html /er/viewad/13.html
    WEB 12332321 5 /er/viewad/13.html /er/viewad/24.html
    WEB 12332321 6 /er/viewad/24.html /selectcat/che/
    WEB 12332321 7 /selectcat/che/ /che/viewad/93.html
    WEB 12332321 8 /che/viewad/93.html /che/viewad/10.html
    WEB 12332321 9 /che/viewad/10.html /home/
    WAP 32483923 1 NULL /m/home/
    WAP 32483923 2 /m/home/ /m/selectcat/fang/
    WAP 32483923 3 /m/selectcat/fang/ /m/fang/33.html
    WAP 32483923 4 /m/fang/33.html /m/fang/54.html
    WAP 32483923 5 /m/fang/54.html /m/selectcat/er/
    WAP 32483923 6 /m/selectcat/er/ /m/er/49.html
    …… …… …… ……
    PLATFORM和USER_ID还是代表平台和用户ID;SEQ字段代表用户按时间排序后的访问顺序,FROM_URL和TO_URL分别代表用户从哪一页跳转到哪一页。对于某个平台上某个用户的第一条访问记录,其FROM_URL是NULL(空值)。


    面试官说需要用两种办法做出来:
    1、实现一个能加速上述处理过程的Hive Generic UDF,并给出使用此UDF实现ETL过程的Hive SQL

    2、实现基于纯Hive SQL的ETL过程,从TRLOG表生成ALLOG表;(结果是一套SQL)

    答案:

    1.

    UDF

    package org.apache.hadoop.hive.udf;
    
    public class RowNumber extends org.apache.hadoop.hive.ql.exec.UDF {
    	 
        private static int MAX_VALUE = 50;
        private static String comparedColumn[] = new String[MAX_VALUE];
        private static int rowNum = 1;
     
        public int evaluate(Object... args) {
            String columnValue[] = new String[args.length];
            for (int i = 0; i < args.length; i++)
                columnValue[i] = args[i].toString();
            if (rowNum == 1)
            {
     
                for (int i = 0; i < columnValue.length; i++)
                    comparedColumn[i] = columnValue[i];
            }
     
            for (int i = 0; i < columnValue.length; i++)
            {
     
                if (!comparedColumn[i].equals(columnValue[i]))
                {
                    for (int j = 0; j < columnValue.length; j++)
                    {
                        comparedColumn[j] = columnValue[j];
                    }
                    rowNum = 1;
                    return rowNum++;
                }
            }
            return rowNum++;
        }
        
        public static void main(String[] args) {
        	RowNumber aRowNumber = new RowNumber();
        	System.out.println(aRowNumber.evaluate("12332321"));
        	System.out.println(aRowNumber.evaluate("12332321"));
        	System.out.println(aRowNumber.evaluate("12332321"));
        	System.out.println(aRowNumber.evaluate("12332321"));
        	System.out.println(aRowNumber.evaluate("12332321"));
    	}
        
    }
    

      

    INSERT OVERWRITE TABLE ALLOG
    SELECT t1.platform,t1.user_id,RowNumber(t1.user_id)seq,t2.click_url FROM_URL,t1.click_url TO_URL FROM
    (select *,RowNumber(user_id)seq from trlog)t1
    LEFT OUTER JOIN
    (select *,RowNumber(user_id)seq from trlog)t2
    on t1.user_id = t2.user_id and t1.seq=t2.seq+1;

    2.

    INSERT OVERWRITE TABLE ALLOG
    SELECT t1.platform,t1.user_id,t1.seq,t2.click_url FROM_URL,t1.click_url TO_URL FROM
    (SELECT platform,user_id,click_time,click_url,count(1) seq FROM (SELECT a.*,b.click_time click_time1,b.click_url click_url2  FROM trlog a left outer join trlog b on a.user_id = b.user_id)t WHERE click_time>=click_time1 GROUP BY platform,user_id,click_time,click_url)t1
    LEFT OUTER JOIN
    (SELECT platform,user_id,click_time,click_url,count(1) seq FROM (SELECT a.*,b.click_time click_time1,b.click_url click_url2  FROM trlog a left outer join trlog b on a.user_id = b.user_id)t WHERE click_time>=click_time1 GROUP BY platform,user_id,click_time,click_url )t2 
    on t1.user_id = t2.user_id and t1.seq = t2.seq + 1;

    使用到的知识点为:

    left outer join  左表全部显示,右表只显示满足条件的

     3、对于以上的文本处理 我们可以很快的联想到shell中awk的处理

    利用awk 中数组的相关操作,方法如下

    cat url.txt |awk -F\t 'BEGIN{OFS="	"}{a[$1]++;b[a[$1]]=$4;print a[$1],$1,$2,$3,b[a[$1]-1],$4}'
    

      其中OFS为输出的字段的定界符,这里利用了2个数组,a和b

    输出为:

    1	WEB	12332321	2013-03-21 13:48:31.324		/home
    2	WEB	12332321	2013-03-21 13:48:32.954	/home	/selectcat/er
    3	WEB	12332321	2013-03-21 13:48:46.365	/selectcat/er	/er/viewad/12.html
    4	WEB	12332321	2013-03-21 13:48:53.651	/er/viewad/12.html	/er/viewad/13.html
    5	WEB	12332321	2013-03-21 13:49:13.435	/er/viewad/13.html	/er/viewad/24.html
    6	WEB	12332321	2013-03-21 13:49:35.876	/er/viewad/24.html	/selectcat/che/
    7	WEB	12332321	2013-03-21 13:49:56.398	/selectcat/che/	/che/viewad/93.html
    8	WEB	12332321	2013-03-21 13:50:03.143	/che/viewad/93.html	/che/viewad/10.html
    9	WEB	12332321	2013-03-21 13:50:34.265	/che/viewad/10.html	/home/
    1	WAP	32483923	2013-03-21 23:58:41.123		/m/home/
    2	WAP	32483923	2013-03-21 23:59:16.123	/m/home/	/m/selectcat/fang/
    3	WAP	32483923	2013-03-21 23:59:45.123	/m/selectcat/fang/	/m/fang/33.html
    4	WAP	32483923	2013-03-22 00:00:23.984	/m/fang/33.html	/m/fang/54.html
    5	WAP	32483923	2013-03-22 00:00:54.043	/m/fang/54.html	/m/selectcat/er/
    6	WAP	32483923	2013-03-22 00:01:16.576	/m/selectcat/er/	/m/er/49.html
    

      

  • 相关阅读:
    我们为何要使用多线程,它有什么优点?
    Java并发和多线程那些事儿
    【BJG吐槽汇】第2期
    【BJG吐槽汇】第一期
    360:且用且珍惜!解决虚拟机linux启动缓慢以及ssh端卡顿的问题!
    多个不同的app应用间应该如何进行消息推送呢?
    JSONResult 封装
    MySQL 优化集锦
    学习bootstrap3
    开发一个响应式的静态网站---实战
  • 原文地址:https://www.cnblogs.com/ggbond1988/p/4807816.html
Copyright © 2011-2022 走看看