zoukankan      html  css  js  c++  java
  • 转:hive面试题

    有一张很大的表:TRLOG
    该表大概有2T左右
    TRLOG:
    CREATE TABLE TRLOG
    (PLATFORM string,
    USER_ID int,
    CLICK_TIME string,
    CLICK_URL string)
    row format delimited
    fields terminated by ' ';

    数据:
    PLATFORM USER_ID CLICK_TIME CLICK_URL
    WEB 12332321 2013-03-21 13:48:31.324 /home/
    WEB 12332321 2013-03-21 13:48:32.954 /selectcat/er/
    WEB 12332321 2013-03-21 13:48:46.365 /er/viewad/12.html
    WEB 12332321 2013-03-21 13:48:53.651 /er/viewad/13.html
    WEB 12332321 2013-03-21 13:49:13.435 /er/viewad/24.html
    WEB 12332321 2013-03-21 13:49:35.876 /selectcat/che/
    WEB 12332321 2013-03-21 13:49:56.398 /che/viewad/93.html
    WEB 12332321 2013-03-21 13:50:03.143 /che/viewad/10.html
    WEB 12332321 2013-03-21 13:50:34.265 /home/
    WAP 32483923 2013-03-21 23:58:41.123 /m/home/
    WAP 32483923 2013-03-21 23:59:16.123 /m/selectcat/fang/
    WAP 32483923 2013-03-21 23:59:45.123 /m/fang/33.html
    WAP 32483923 2013-03-22 00:00:23.984 /m/fang/54.html
    WAP 32483923 2013-03-22 00:00:54.043 /m/selectcat/er/
    WAP 32483923 2013-03-22 00:01:16.576 /m/er/49.html
    …… …… …… ……

    需要把上述数据处理为如下结构的表ALLOG:
    CREATE TABLE ALLOG
    (PLATFORM string,
    USER_ID int,
    SEQ int,
    FROM_URL string,
    TO_URL string)
    row format delimited
    fields terminated by ' ';

    整理后的数据结构:
    PLATFORM USER_ID SEQ FROM_URL TO_URL
    WEB 12332321 1 NULL /home/
    WEB 12332321 2 /home/ /selectcat/er/
    WEB 12332321 3 /selectcat/er/ /er/viewad/12.html
    WEB 12332321 4 /er/viewad/12.html /er/viewad/13.html
    WEB 12332321 5 /er/viewad/13.html /er/viewad/24.html
    WEB 12332321 6 /er/viewad/24.html /selectcat/che/
    WEB 12332321 7 /selectcat/che/ /che/viewad/93.html
    WEB 12332321 8 /che/viewad/93.html /che/viewad/10.html
    WEB 12332321 9 /che/viewad/10.html /home/
    WAP 32483923 1 NULL /m/home/
    WAP 32483923 2 /m/home/ /m/selectcat/fang/
    WAP 32483923 3 /m/selectcat/fang/ /m/fang/33.html
    WAP 32483923 4 /m/fang/33.html /m/fang/54.html
    WAP 32483923 5 /m/fang/54.html /m/selectcat/er/
    WAP 32483923 6 /m/selectcat/er/ /m/er/49.html
    …… …… …… ……
    PLATFORM和USER_ID还是代表平台和用户ID;SEQ字段代表用户按时间排序后的访问顺序,FROM_URL和TO_URL分别代表用户从哪一页跳转到哪一页。对于某个平台上某个用户的第一条访问记录,其FROM_URL是NULL(空值)。


    面试官说需要用两种办法做出来:
    1、实现一个能加速上述处理过程的Hive Generic UDF,并给出使用此UDF实现ETL过程的Hive SQL

    2、实现基于纯Hive SQL的ETL过程,从TRLOG表生成ALLOG表;(结果是一套SQL)

    答案:

    1.

    UDF

    [java] view plaincopy
     
    1. package org.apache.hadoop.hive.udf;  
    2.   
    3. public class RowNumber extends org.apache.hadoop.hive.ql.exec.UDF {  
    4.        
    5.     private static int MAX_VALUE = 50;  
    6.     private static String comparedColumn[] = new String[MAX_VALUE];  
    7.     private static int rowNum = 1;  
    8.    
    9.     public int evaluate(Object... args) {  
    10.         String columnValue[] = new String[args.length];  
    11.         for (int i = 0; i < args.length; i++)  
    12.             columnValue[i] = args[i].toString();  
    13.         if (rowNum == 1)  
    14.         {  
    15.    
    16.             for (int i = 0; i < columnValue.length; i++)  
    17.                 comparedColumn[i] = columnValue[i];  
    18.         }  
    19.    
    20.         for (int i = 0; i < columnValue.length; i++)  
    21.         {  
    22.    
    23.             if (!comparedColumn[i].equals(columnValue[i]))  
    24.             {  
    25.                 for (int j = 0; j < columnValue.length; j++)  
    26.                 {  
    27.                     comparedColumn[j] = columnValue[j];  
    28.                 }  
    29.                 rowNum = 1;  
    30.                 return rowNum++;  
    31.             }  
    32.         }  
    33.         return rowNum++;  
    34.     }  
    35.       
    36.     public static void main(String[] args) {  
    37.         RowNumber aRowNumber = new RowNumber();  
    38.         System.out.println(aRowNumber.evaluate("12332321"));  
    39.         System.out.println(aRowNumber.evaluate("12332321"));  
    40.         System.out.println(aRowNumber.evaluate("12332321"));  
    41.         System.out.println(aRowNumber.evaluate("12332321"));  
    42.         System.out.println(aRowNumber.evaluate("12332321"));  
    43.     }  
    44.       
    45. }  



    INSERT OVERWRITE TABLE ALLOG
    SELECT t1.platform,t1.user_id,RowNumber(t1.user_id)seq,t2.click_url FROM_URL,t1.click_url TO_URL FROM
    (select *,RowNumber(user_id)seq from trlog)t1
    LEFT OUTER JOIN
    (select *,RowNumber(user_id)seq from trlog)t2
    on t1.user_id = t2.user_id and t1.seq=t2.seq+1;

    2.

    INSERT OVERWRITE TABLE ALLOG
    SELECT t1.platform,t1.user_id,t1.seq,t2.click_url FROM_URL,t1.click_url TO_URL FROM
    (SELECT platform,user_id,click_time,click_url,count(1) seq FROM (SELECT a.*,b.click_time click_time1,b.click_url click_url2  FROM trlog a left outer join trlog b on a.user_id = b.user_id)t WHERE click_time>=click_time1 GROUP BY platform,user_id,click_time,click_url)t1
    LEFT OUTER JOIN
    (SELECT platform,user_id,click_time,click_url,count(1) seq FROM (SELECT a.*,b.click_time click_time1,b.click_url click_url2  FROM trlog a left outer join trlog b on a.user_id = b.user_id)t WHERE click_time>=click_time1 GROUP BY platform,user_id,click_time,click_url )t2 
    on t1.user_id = t2.user_id and t1.seq = t2.seq + 1;

  • 相关阅读:
    PHP函数
    git (1)
    JavaScript(4)
    javascript-DOM(3)
    JavaScript-DOM(2)
    [转]分布式架构知识体系
    Mysql中查看每个IP的连接数
    Git常用命令备忘录
    windows下用vscode写C++
    sudo cd为什么不能够执行?
  • 原文地址:https://www.cnblogs.com/duanxingxing/p/4936751.html
Copyright © 2011-2022 走看看