zoukankan      html  css  js  c++  java
  • 1.hive开窗函数,分析函数

    http://yugouai.iteye.com/blog/1908121

    分析函数用于计算基于组的某种聚合值,它和聚合函数的不同之处是:对于每个组返回多行,而
    聚合函数对于每个组只返回一行。
    开窗函数指定了分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化


    drop table if exists student;

    create table student
    (
       name           string,
       class          tinyint,
       cooperator_name   string,
       score          tinyint
    )
    row format delimited fields terminated by '|';


    vi /root/test2
    adf|3|测试公司1|45
    asdf|3|测试公司2|55
    cfe|2|测试公司2|74
    3dd|3|测试公司5|78
    fda|1|测试公司7|80
    gds|2|测试公司9|92
    ffd|1|测试公司10|95
    dss|1|测试公司4|95
    ddd|3|测试公司3|99
    gf|3|测试公司9|99

    load data local inpath '/root/test2' into table student;

    hive> select * from student;
    OK
    adf     3       测试公司1       45
    asdf    3       测试公司2       55
    cfe     2       测试公司2       74
    dd      3       测试公司5       NULL
    fda     1       测试公司7       80
    gds     2       测试公司9       92
    ffd     1       测试公司10      95
    dss     1       测试公司4       95
    ddd     3       测试公司3       99
    gf      3       测试公司9       99
    Time taken: 0.642 seconds, Fetched: 10 row(s)

    1:over后的写法:    
       over(order by salary) 按照salary排序进行累计,order by是个默认的开窗函数
       over(partition by deptno)按照部门分区
       over(partition by deptno order by salary)按照部门分区,在一个分区内按照salary排序

    2:开窗的窗口范围:
    2.1
    over(order by salary range between 2 preceding and 2 following):窗口范围为当前行数据幅度减2加2后的范围内的。

    --sum(s)over(order by s range between 2 preceding and 2 following) 表示加2或2的范围内的求和

    select name,class,score,sum(score)over(order by score range between 2 preceding and 2 following) mm from student;
    adf        3        45        45  --45加2减2即43到47,但是s在这个范围内只有45
    asdf       3        55        55
    cfe        2        74        74
    3dd        3        78        158 --78在76到80范围内有78,80,求和得158
    fda        1        80        158
    gds        2        92        92
    ffd        1        95        190
    dss        1        95        190
    ddd        3        99        198
    gf         3        99        198

    2.2
    over(order by salary rows between 5 preceding and 5 following):窗口范围为当前行前后各移动5行。
    举例:

    --sum(s)over(order by s rows between 2 preceding and 2 following)表示在上下两行之间的范围内
    select name,class,s, sum(s)over(order by s rows between 2 preceding and 2 following) mm from t2
    adf        3        45        174  (45+55+74=174)
    asdf       3        55        252   (45+55+74+78=252)
    cfe        2        74        332    (74+55+45+78+80=332)
    3dd        3        78        379    (78+74+55+80+92=379)
    fda        1        80        419
    gds        2        92        440
    ffd        1        95        461
    dss        1        95        480
    ddd        3        99        388
    gf         3        99        293
     
    2.3
    over(order by salary range between unbounded preceding and unbounded following)或者
    over(order by salary rows between unbounded preceding and unbounded following):窗口不做限制

    3.与over函数结合的几个函数介绍
    row_number()over()、rank()over()和dense_rank()over()函数的使用

    --create table
    drop table if exists student;
    create table student
    (
       name           string,
       class          tinyint,
       score          tinyint
    )
    row format delimited fields terminated by '|';

    vi /root/test3
    adf|3|45
    asdf|3|55
    cfe|2|74
    3dd|3|78
    fda|1|80
    gds|2|92
    ffd|1|95
    dss|1|95
    ddd|3|99
    gf|3|99

    3.1rank()over()类型的
    select * from                                                                      
        (                                                                           
        select name,class,score,rank()over(partition by class order by score desc) mm from t2
        )                                                                           
        where mm=1;
    得到的结果是:
    dss        1        95        1
    ffd        1        95        1
    gds        2        92        1
    gf         3        99        1
    ddd        3        99        1

    注意:
    1.在求第一名成绩的时候,不能用row_number(),因为如果同班有两个并列第一,row_number()只返回一个结果;

    3.2 row_number()over 类型
    select * from                                                                      
        (                                                                           
        select name,class,score,row_number()over(partition by class order by score desc) mm from t2
        )                                                                           
        where mm=1;
    1        95        1  --95有两名但是只显示一个
    2        92        1
    3        99        1 --99有两名但也只显示一个

    3.3
    rank()和dense_rank()可以将所有的都查找出来:
    如上可以看到采用rank可以将并列第一名的都查找出来;
    rank()和dense_rank()区别:
    --rank()是跳跃排序,有两个第二名时接下来就是第四名;
    select name,class,score,rank()over(partition by class order by score desc) mm from t2
    dss        1        95        1
    ffd        1        95        1
    fda        1        80        3 --直接就跳到了第三
    gds        2        92        1
    cfe        2        74        2
    gf         3        99        1
    ddd        3        99        1
    3dd        3        78        3
    asdf       3        55        4
    adf        3        45        5
    --dense_rank()l是连续排序,有两个第二名时仍然跟着第三名
    select name,class,score,dense_rank()over(partition by class order by score desc) mm from t2
    dss        1        95        1
    ffd        1        95        1
    fda        1        80        2 --连续排序(仍为2)
    gds        2        92        1
    cfe        2        74        2
    gf         3        99        1
    ddd        3        99        1
    3dd        3        78        2
    asdf       3        55        3
    adf        3        45        4

    3.4
    --sum()over()的使用
    select name,class,score, sum(s)over(partition by class order by score desc) mm from t2 --根据班级进行分数求和
    dss        1        95        190 --由于两个95都是第一名,所以累加时是两个第一名的相加
    ffd        1        95        190
    fda        1        80        270 --第一名加上第二名的
    gds        2        92        92  --92
    cfe        2        74        166 --92+74=166
    gf         3        99        198 --99+99
    ddd        3        99        198 --99+99
    3dd        3        78        276 --198+78
    asdf       3        55        331 --276+55
    adf        3        45        376 --331+45


    4.first_value()over()和last_value()over()的使用
    --create table
    drop table if exists rm_circuit_route;
    create table rm_circuit_route
    (
       opr_id           string,
       serial_no         tinyint,
       res_type         string,
       res_id          tinyint,
       route_name          string
    )
    row format delimited fields terminated by '|';


    vi /root/test3
    000100190000000000021311|1|2|000100190000000001289311|光大会展1
    000100190000000000021311|1|6|000100190000000001289311|光大会展2
    000100190000000000021311|7|2|000100190000000001289318|光大会展3
    000100190000000000021339|1|4|000100190000000001289311|光大会展4
    000100190000000000021311|3|7|000100190000000001289313|光大会展5
    000100190000000000021355|1|2|000100190000000001289314|光大会展6
    000100190000000000021355|2|2|000100190000000001289314|光大会展7
    000100190000000000021311|1|9|000100190000000001289315|光大会展8
    000100190000000000021339|8|2|000100190000000001289316|光大会展9
    000100190000000000021311|1|2|000100190000000001289311|光大会展10

    --找出这三条电路每条电路的第一条记录类型和最后一条记录类型
    ---使用rows BETWEEN unbounded preceding AND unbounded following
    SELECT opr_id,res_type,
           first_value(res_type) over(PARTITION BY opr_id ORDER BY res_type) low,
           last_value(res_type) over(PARTITION BY opr_id ORDER BY res_type rows BETWEEN unbounded preceding AND unbounded following) high
      FROM rm_circuit_route
    WHERE opr_id IN ('000100190000000000021311','000100190000000000021355','000100190000000000021339')
     ORDER BY opr_id;


    ---取last_value时不使用rows BETWEEN unbounded preceding AND unbounded following的结果
    SELECT opr_id,res_type,
           first_value(res_type) over(PARTITION BY opr_id ORDER BY res_type) low,
           last_value(res_type) over(PARTITION BY opr_id ORDER BY res_type) high
      FROM rm_circuit_route
     WHERE opr_id IN ('000100190000000000021311','000100190000000000021355','000100190000000000021339')
     ORDER BY opr_id;
    如下图可以看到,如果不使用
    rows BETWEEN unbounded preceding AND unbounded following,取出的last_value由于与res_type进行进行排列,因此取出的电路的最后一行记录的类型就不是按照电路的范围提取了,而是以res_type为范围进行提取了


    4.1
    在first_value和last_value中ignore nulls的使用
    取出该电路的第一条记录,加上ignore nulls后,如果第一条是判断的那个字段是空的,则默认取下一条,结果如下所示:
    SELECT opr_id,res_type,
           first_value(route_name ignore nulls) over(ORDER BY opr_id)
      FROM rm_circuit_route
     WHERE opr_id=('000100190000000000021311');

     或者
     SELECT opr_id,res_type,
           first_value(route_name ignore nulls) over(ORDER BY opr_id rows BETWEEN unbounded preceding AND unbounded following)
      FROM rm_circuit_route
     WHERE opr_id=('000100190000000000021311');


     5.
    --lag() over()函数用法(取出前n行数据)
    lag(expresstion,<offset>,<default>)
    with a as
    (select 1 id,'a' name from dual
     union
     select 2 id,'b' name from dual
     union
     select 3 id,'c' name from dual
     union
     select 4 id,'d' name from dual
     union
     select 5 id,'e' name from dual
    )
    select id,name,lag(id,1,'')over(order by name) from a;

    --lead() over()函数用法(取出后N行数据)
    lead(expresstion,<offset>,<default>)
    with a as
    (select 1 id,'a' name from dual
     union
     select 2 id,'b' name from dual
     union
     select 3 id,'c' name from dual
     union
     select 4 id,'d' name from dual
     union
     select 5 id,'e' name from dual
    )
    select id,name,lead(id,1,'')over(order by name) from a;

    --ratio_to_report(a)函数用法 Ratio_to_report() 括号中就是分子,over() 括号中就是分母
    with a as (select 1 a from dual
               union all
    select 1 a from dual
               union  all
    select 1 a from dual
               union all
    select 2 a from dual
               union all
    select 3 a from dual
               union all
    select 4 a from dual
               union all
    select 4 a from dual
               union all
    select 5 a from dual
               )
    select a, ratio_to_report(a)over(partition by a) b from a
    order by a;

    with a as (select 1 a from dual
               union all
    select 1 a from dual
               union  all
    select 1 a from dual
               union all
    select 2 a from dual
               union all
    select 3 a from dual
               union all
    select 4 a from dual
               union all
    select 4 a from dual
               union all
    select 5 a from dual
               )
    select a, ratio_to_report(a)over() b from a --分母缺省就是整个占比
    order by a;

    with a as (select 1 a from dual
               union all
    select 1 a from dual
               union  all
    select 1 a from dual
               union all
    select 2 a from dual
               union all
    select 3 a from dual
               union all
    select 4 a from dual
               union all
    select 4 a from dual
               union all
    select 5 a from dual
               )
    select a, ratio_to_report(a)over() b from a
    group by a order by a;--分组后的占比

    ---percent_rank用法
    计算方法:所在组排名序号-1除以该组所有的行数-1,如下所示自己计算的pr1与通过percent_rank函数得到的值是一样的:
    SELECT a.deptno,
           a.ename,
           a.sal,
           a.r,
           b.n,
           (a.r-1)/(n-1) pr1,
           percent_rank() over(PARTITION BY a.deptno ORDER BY a.sal) pr2
      FROM (SELECT deptno,
                   ename,
                   sal,
                   rank() over(PARTITION BY deptno ORDER BY sal) r --计算出在组中的排名序号
              FROM emp
             ORDER BY deptno, sal) a,
           (SELECT deptno, COUNT(1) n FROM emp GROUP BY deptno) b --按部门计算每个部门的所有成员数
     WHERE a.deptno = b.deptno;

    --cume_dist函数
    计算方法:所在组排名序号除以该组所有的行数,但是如果存在并列情况,则需加上并列的个数-1,
              如下所示自己计算的pr1与通过percent_rank函数得到的值是一样的:
    SELECT a.deptno,
           a.ename,
           a.sal,
           a.r,
           b.n,
           c.rn,
           (a.r + c.rn - 1) / n pr1,
           cume_dist() over(PARTITION BY a.deptno ORDER BY a.sal) pr2
      FROM (SELECT deptno,
                   ename,
                   sal,
                   rank() over(PARTITION BY deptno ORDER BY sal) r
              FROM emp
             ORDER BY deptno, sal) a,
           (SELECT deptno, COUNT(1) n FROM emp GROUP BY deptno) b,
           (SELECT deptno, r, COUNT(1) rn,sal
              FROM (SELECT deptno,sal,
                           rank() over(PARTITION BY deptno ORDER BY sal) r
                      FROM emp)
             GROUP BY deptno, r,sal
             ORDER BY deptno) c --c表就是为了得到每个部门员工工资的一样的个数
     WHERE a.deptno = b.deptno
       AND a.deptno = c.deptno(+)
       AND a.sal = c.sal;


    --percentile_cont函数

    含义:输入一个百分比(该百分比就是按照percent_rank函数计算的值),返回该百分比位置的平均值
    如下,输入百分比为0.7,因为0.7介于0.6和0.8之间,因此返回的结果就是0.6对应的sal的1500加上0.8对应的sal的1600平均
    SELECT ename,
           sal,
           deptno,
           percentile_cont(0.7) within GROUP(ORDER BY sal) over(PARTITION BY deptno) "Percentile_Cont",
           percent_rank() over(PARTITION BY deptno ORDER BY sal) "Percent_Rank"
      FROM emp
     WHERE deptno IN (30, 60);


    PERCENTILE_DISC函数

    功能描述:返回一个与输入的分布百分比值相对应的数据值,分布百分比的计算方法见函数CUME_DIST,如果没有正好对应的数据值,就取大于该分布值的下一个值。
    注意:本函数与PERCENTILE_CONT的区别在找不到对应的分布值时返回的替代值的计算方法不同

    SAMPLE:下例中0.7的分布值在部门30中没有对应的Cume_Dist值,所以就取下一个分布值0.83333333所对应的SALARY来替代

    SELECT ename,
           sal,
           deptno,
           percentile_disc(0.7) within GROUP(ORDER BY sal) over(PARTITION BY deptno) "Percentile_Disc",
           cume_dist() over(PARTITION BY deptno ORDER BY sal) "Cume_Dist"
      FROM emp
     WHERE deptno IN (30, 60);

     
     
     
     开窗函数二:实验部分
     
    Hive分析窗口函数(1) GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
     数据准备:
    vi test
    2015-03,2015-03-10,cookie1
    2015-03,2015-03-10,cookie5
    2015-03,2015-03-12,cookie7
    2015-04,2015-04-12,cookie3
    2015-04,2015-04-13,cookie2
    2015-04,2015-04-13,cookie4
    2015-04,2015-04-16,cookie4
    2015-03,2015-03-10,cookie2
    2015-03,2015-03-10,cookie3
    2015-04,2015-04-12,cookie5
    2015-04,2015-04-13,cookie6
    2015-04,2015-04-15,cookie3
    2015-04,2015-04-15,cookie2
    2015-04,2015-04-16,cookie1

    --create table
    CREATE EXTERNAL TABLE austin (
        month         STRING,
        day STRING,
        cookieid     STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    stored as textfile;
    --导入本地数据
    load data local inpath '/hadoop/home/austin' into table austin;
     
    hive> select * from austin;
    OK
    2015-03 2015-03-10      cookie1
    2015-03 2015-03-10      cookie5
    2015-03 2015-03-12      cookie7
    2015-04 2015-04-12      cookie3
    2015-04 2015-04-13      cookie2
    2015-04 2015-04-13      cookie4
    2015-04 2015-04-16      cookie4
    2015-03 2015-03-10      cookie2
    2015-03 2015-03-10      cookie3
    2015-04 2015-04-12      cookie5
    2015-04 2015-04-13      cookie6
    2015-04 2015-04-15      cookie3
    2015-04 2015-04-15      cookie2
    2015-04 2015-04-16      cookie1
     
    --GROUPING SETS
    在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
    SELECT month,
        day,
        COUNT(DISTINCT cookieid) AS uv,
        GROUPING__ID
    FROM austin
    GROUP BY month,day
    GROUPING SETS (month,day)
    ORDER BY GROUPING__ID;
     
    Total MapReduce CPU Time Spent: 5 seconds 740 msec
    OK
    2015-04 NULL            6       1
    2015-03 NULL            5       1
    NULL    2015-04-16      2       2
    NULL    2015-04-15      2       2
    NULL    2015-04-13      3       2
    NULL    2015-04-12      2       2
    NULL    2015-03-12      1       2
    NULL    2015-03-10      4       2
    Time taken: 67.361 seconds, Fetched: 8 row(s)
     
    等价于
    SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM austin GROUP BY month
    UNION ALL
    SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM austin GROUP BY day

    SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM austin GROUP BY month;
    Total MapReduce CPU Time Spent: 4 seconds 800 msec
    OK
    2015-03 NULL    5       1
    2015-04 NULL    6       1
    Time taken: 33.225 seconds, Fetched: 2 row(s)

    SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM austin GROUP BY day;
    Total MapReduce CPU Time Spent: 3 seconds 140 msec
    OK
    NULL    2015-03-10      4       2
    NULL    2015-03-12      1       2
    NULL    2015-04-12      2       2
    NULL    2015-04-13      3       2
    NULL    2015-04-15      2       2
    NULL    2015-04-16      2       2
    Time taken: 30.019 seconds, Fetched: 6 row(s)

    最终:
    SELECT t.month,
        t.day,
        t.uv,
        t.GROUPING__ID
    FROM(
    SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM austin GROUP BY month
        UNION ALL
        SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM austin GROUP BY day
    ) t
    ORDER BY GROUPING__ID;
    Total MapReduce CPU Time Spent: 8 seconds 330 msec
    OK
    2015-04 NULL            6       1
    2015-03 NULL            5       1
    NULL    2015-04-16      2       2
    NULL    2015-04-15      2       2
    NULL    2015-04-13      3       2
    NULL    2015-04-12      2       2
    NULL    2015-03-12      1       2
    NULL    2015-03-10      4       2
    Time taken: 65.355 seconds, Fetched: 8 row(s)

    举一反三:加入分组(month,day)--每一个月的每一天统计uv
    SELECT
        month,
        day,
        COUNT(DISTINCT cookieid) AS uv,
        GROUPING__ID
    FROM austin
    GROUP BY month,day
    GROUPING SETS (month,day,(month,day))
    ORDER BY GROUPING__ID;


    等价于
    SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM austin GROUP BY month
    UNION ALL
    SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM austin GROUP BY day
    UNION ALL
    SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM austin GROUP BY month,day

    其中的 GROUPING__ID,表示结果属于哪一个分组集合。

    SELECT t.month,
        t.day,
        t.uv,
        t.GROUPING__ID
    FROM(
    SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM austin GROUP BY month
        UNION ALL
        SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM austin GROUP BY day
        UNION ALL
        SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM austin GROUP BY month,day)
    ORDER BY GROUPING__ID;

    Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.06 sec   HDFS Read: 5427 HDFS Write: 276 SUCCESS
    Total MapReduce CPU Time Spent: 7 seconds 170 msec
    OK
    2015-04 NULL    6       1
    2015-03 NULL    5       1
    NULL    2015-03-10      4       2
    NULL    2015-04-16      2       2
    NULL    2015-04-15      2       2
    NULL    2015-04-13      3       2
    NULL    2015-04-12      2       2
    NULL    2015-03-12      1       2
    2015-04 2015-04-16      2       3
    2015-04 2015-04-12      2       3
    2015-04 2015-04-13      3       3
    2015-03 2015-03-12      1       3
    2015-03 2015-03-10      4       3
    2015-04 2015-04-15      2       3
    Time taken: 91.51 seconds, Fetched: 14 row(s)

    --CUBE
    根据GROUP BY的维度的所有组合进行聚合。
    month day 分组的有4中组合
    SELECT month,
        day,
        COUNT(DISTINCT cookieid) AS uv,
        GROUPING__ID
    FROM austin
    GROUP BY month,day
    WITH CUBE
    ORDER BY GROUPING__ID;

    Total MapReduce CPU Time Spent: 4 seconds 480 msec
    OK
    NULL    NULL            7       0
    2015-03 NULL            5       1
    2015-04 NULL            6       1
    NULL    2015-04-16      2       2
    NULL    2015-04-15      2       2
    NULL    2015-04-13      3       2
    NULL    2015-04-12      2       2
    NULL    2015-03-12      1       2
    NULL    2015-03-10      4       2
    2015-04 2015-04-12      2       3
    2015-04 2015-04-16      2       3
    2015-03 2015-03-12      1       3
    2015-03 2015-03-10      4       3
    2015-04 2015-04-15      2       3
    2015-04 2015-04-13      3       3
    Time taken: 47.217 seconds, Fetched: 15 row(s)


    ---ROLLUP
    是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
    比如,以month维度进行层级聚合:
    SELECT month,
        day,
        COUNT(DISTINCT cookieid) AS uv,
        GROUPING__ID  
    FROM austin
    GROUP BY month,day
    WITH ROLLUP
    ORDER BY GROUPING__ID;

    Total MapReduce CPU Time Spent: 6 seconds 520 msec
    OK
    NULL    NULL    7       0
    2015-04 NULL    6       1
    2015-03 NULL    5       1
    2015-04 2015-04-16      2       3
    2015-04 2015-04-15      2       3
    2015-04 2015-04-13      3       3
    2015-04 2015-04-12      2       3
    2015-03 2015-03-12      1       3
    2015-03 2015-03-10      4       3
    Time taken: 239.641 seconds, Fetched: 9 row(s)
    可以实现这样的上钻过程:
    月天的UV->月的UV->总UV

    --把month和day调换顺序,则以day维度进行层级聚合:
     
    SELECT
    day,
    month,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID  
    FROM lxw1234
    GROUP BY day,month
    WITH ROLLUP
    ORDER BY GROUPING__ID;

    day                    month              uv     GROUPING__ID
    -------------------------------------------------------
    NULL            NULL               7       0
    2015-04-13      NULL               3       1
    2015-03-12      NULL               1       1
    2015-04-15      NULL               2       1
    2015-03-10      NULL               4       1
    2015-04-16      NULL               2       1
    2015-04-12      NULL               2       1
    2015-04-12      2015-04            2       3
    2015-03-10      2015-03            4       3
    2015-03-12      2015-03            1       3
    2015-04-13      2015-04            3       3
    2015-04-15      2015-04            2       3
    2015-04-16      2015-04            2       3
     
    可以实现这样的上钻过程:
    天月的UV->天的UV->总UV
    (这里,根据天和月进行聚合,和根据天聚合结果一样,因为有父子关系,如果是其他维度组合的话,就会不一样)

    Hive分析窗口函数(2) NTILE,ROW_NUMBER,RANK,DENSE_RANK
    数据准备:表austin 公司的164 qa_test库中
    vim austin2
    cookie1,2015-04-10,1
    cookie1,2015-04-11,5
    cookie1,2015-04-12,7
    cookie1,2015-04-13,3
    cookie1,2015-04-14,2
    cookie1,2015-04-15,4
    cookie1,2015-04-16,4
    cookie2,2015-04-10,2
    cookie2,2015-04-11,3
    cookie2,2015-04-12,5
    cookie2,2015-04-13,6
    cookie2,2015-04-14,3
    cookie2,2015-04-15,9
    cookie2,2015-04-16,7

    --create table
    CREATE TABLE austin2 (
    cookieid string,
    createtime string,   --day
    pv INT
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    stored as textfile;
    --导入本地数据
    load data local inpath '/home/hadoop/austin2' into table austin2;

    hive> select * from austin2;
    OK
    cookie1 2015-04-10      1
    cookie1 2015-04-11      5
    cookie1 2015-04-12      7
    cookie1 2015-04-13      3
    cookie1 2015-04-14      2
    cookie1 2015-04-15      4
    cookie1 2015-04-16      4
    cookie2 2015-04-10      2
    cookie2 2015-04-11      3
    cookie2 2015-04-12      5
    cookie2 2015-04-13      6
    cookie2 2015-04-14      3
    cookie2 2015-04-15      9
    cookie2 2015-04-16      7
    Time taken: 0.055 seconds, Fetched: 14 row(s)


    --NTILE
    NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
    NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
    如果切片不均匀,默认增加第一个切片的分布

    SELECT
    cookieid,
    createtime,
    pv,
    NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,    --分组内将数据分成2片
    NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,  --分组内将数据分成3片
    NTILE(4) OVER(ORDER BY createtime) AS rn3        --将所有数据分成4片
    FROM austin2
    ORDER BY cookieid,createtime;

    SELECT cookieid,
        createtime,
        pv,
        NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
        NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
        NTILE(4) OVER(ORDER BY createtime) AS rn3
    FROM austin2
    ORDER BY cookieid,createtime;

    Total MapReduce CPU Time Spent: 8 seconds 100 msec
    OK
    cookie1 2015-04-10      1       1       1       1
    cookie1 2015-04-11      5       1       1       1
    cookie1 2015-04-12      7       1       1       2
    cookie1 2015-04-13      3       1       2       2
    cookie1 2015-04-14      2       2       2       3
    cookie1 2015-04-15      4       2       3       4
    cookie1 2015-04-16      4       2       3       4
    cookie2 2015-04-10      2       1       1       1
    cookie2 2015-04-11      3       1       1       1
    cookie2 2015-04-12      5       1       1       2
    cookie2 2015-04-13      6       1       2       2
    cookie2 2015-04-14      3       2       2       3
    cookie2 2015-04-15      9       2       3       3
    cookie2 2015-04-16      7       2       3       4
    Time taken: 66.363 seconds, Fetched: 14 row(s)

    –比如,统计一个cookie,pv数最多的前1/3的天
    SELECT
    cookieid,
    createtime,
    pv,
    NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn
    FROM austin2;

    --rn = 1 的记录,就是我们想要的结果
    Total MapReduce CPU Time Spent: 2 seconds 620 msec
    OK
    cookie1 2015-04-12      7       1
    cookie1 2015-04-11      5       1
    cookie1 2015-04-16      4       1
    cookie1 2015-04-15      4       2
    cookie1 2015-04-13      3       2
    cookie1 2015-04-14      2       3
    cookie1 2015-04-10      1       3
    cookie2 2015-04-15      9       1
    cookie2 2015-04-16      7       1
    cookie2 2015-04-13      6       1
    cookie2 2015-04-12      5       2
    cookie2 2015-04-11      3       2
    cookie2 2015-04-14      3       3
    cookie2 2015-04-10      2       3
    Time taken: 21.98 seconds, Fetched: 14 row(s)

    ---ROW_NUMBER

    ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列
    –比如,按照pv降序排列,生成分组内每天的pv名次
    ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。
    SELECT
    cookieid,
    createtime,
    pv,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
    FROM austin2;
    Total MapReduce CPU Time Spent: 2 seconds 830 msec
    OK
    cookie1 2015-04-12      7       1
    cookie1 2015-04-11      5       2
    cookie1 2015-04-16      4       3
    cookie1 2015-04-15      4       4
    cookie1 2015-04-13      3       5
    cookie1 2015-04-14      2       6
    cookie1 2015-04-10      1       7
    cookie2 2015-04-15      9       1
    cookie2 2015-04-16      7       2
    cookie2 2015-04-13      6       3
    cookie2 2015-04-12      5       4
    cookie2 2015-04-11      3       5
    cookie2 2015-04-14      3       6
    cookie2 2015-04-10      2       7
    Time taken: 21.9 seconds, Fetched: 14 row(s)


    ---RANK 和 DENSE_RANK

    —RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
    —DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

    SELECT
    cookieid,
    createtime,
    pv,
    RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
    DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
    FROM austin2
    WHERE cookieid = 'cookie1';

    Total MapReduce CPU Time Spent: 3 seconds 360 msec
    OK
    cookie1 2015-04-12      7       1       1       1
    cookie1 2015-04-11      5       2       2       2
    cookie1 2015-04-16      4       3       3       3
    cookie1 2015-04-15      4       3       3       4
    cookie1 2015-04-13      3       5       4       5
    cookie1 2015-04-14      2       6       5       6
    cookie1 2015-04-10      1       7       6       7
    Time taken: 23.128 seconds, Fetched: 7 row(s)


    rn1: 15号和16号并列第3, 13号排第5
    rn2: 15号和16号并列第3, 13号排第4
    rn3: 如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排吧。


    Hive分析窗口函数(3) LAG,LEAD,FIRST_VALUE,LAST_VALUE
    数据准备:
    vi austin3
    cookie1,2015-04-10 10:00:02,url2
    cookie1,2015-04-10 10:00:00,url1
    cookie1,2015-04-10 10:03:04,1url3
    cookie1,2015-04-10 10:50:05,url6
    cookie1,2015-04-10 11:00:00,url7
    cookie1,2015-04-10 10:10:00,url4
    cookie1,2015-04-10 10:50:01,url5
    cookie2,2015-04-10 10:00:02,url22
    cookie2,2015-04-10 10:00:00,url11
    cookie2,2015-04-10 10:03:04,1url33
    cookie2,2015-04-10 10:50:05,url66
    cookie2,2015-04-10 11:00:00,url77
    cookie2,2015-04-10 10:10:00,url44
    cookie2,2015-04-10 10:50:01,url55
     
     --create table
    CREATE TABLE austin3 (
    cookieid string,
    createtime string,  --页面访问时间
    url STRING       --被访问页面
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    stored as textfile;

    --导入本地数据
    load data local inpath '/home/hadoop/austin3' into table austin3;

    hive> select * from austin3;
    OK
    cookie1 2015-04-10 10:00:02     url2
    cookie1 2015-04-10 10:00:00     url1
    cookie1 2015-04-10 10:03:04     1url3
    cookie1 2015-04-10 10:50:05     url6
    cookie1 2015-04-10 11:00:00     url7
    cookie1 2015-04-10 10:10:00     url4
    cookie1 2015-04-10 10:50:01     url5
    cookie2 2015-04-10 10:00:02     url22
    cookie2 2015-04-10 10:00:00     url11
    cookie2 2015-04-10 10:03:04     1url33
    cookie2 2015-04-10 10:50:05     url66
    cookie2 2015-04-10 11:00:00     url77
    cookie2 2015-04-10 10:10:00     url44
    cookie2 2015-04-10 10:50:01     url55
    Time taken: 0.05 seconds, Fetched: 14 row(s)

    LAG

    LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
    第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
    LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
    FROM austin3;

    Total MapReduce CPU Time Spent: 3 seconds 220 msec
    OK
    cookie1 2015-04-10 10:00:00     url1    1       1970-01-01 00:00:00     NULL
    cookie1 2015-04-10 10:00:02     url2    2       2015-04-10 10:00:00     NULL
    cookie1 2015-04-10 10:03:04     1url3   3       2015-04-10 10:00:02     2015-04-10 10:00:00
    cookie1 2015-04-10 10:10:00     url4    4       2015-04-10 10:03:04     2015-04-10 10:00:02
    cookie1 2015-04-10 10:50:01     url5    5       2015-04-10 10:10:00     2015-04-10 10:03:04
    cookie1 2015-04-10 10:50:05     url6    6       2015-04-10 10:50:01     2015-04-10 10:10:00
    cookie1 2015-04-10 11:00:00     url7    7       2015-04-10 10:50:05     2015-04-10 10:50:01
    cookie2 2015-04-10 10:00:00     url11   1       1970-01-01 00:00:00     NULL
    cookie2 2015-04-10 10:00:02     url22   2       2015-04-10 10:00:00     NULL
    cookie2 2015-04-10 10:03:04     1url33  3       2015-04-10 10:00:02     2015-04-10 10:00:00
    cookie2 2015-04-10 10:10:00     url44   4       2015-04-10 10:03:04     2015-04-10 10:00:02
    cookie2 2015-04-10 10:50:01     url55   5       2015-04-10 10:10:00     2015-04-10 10:03:04
    cookie2 2015-04-10 10:50:05     url66   6       2015-04-10 10:50:01     2015-04-10 10:10:00
    cookie2 2015-04-10 11:00:00     url77   7       2015-04-10 10:50:05     2015-04-10 10:50:01
    Time taken: 22.855 seconds, Fetched: 14 row(s)

    last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00'  
                 cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00
                 cookie1第三行,往上1行值为第二行值,2015-04-10 10:00:02
                 cookie1第六行,往上1行值为第五行值,2015-04-10 10:50:01
    last_2_time: 指定了往上第2行的值,为指定默认值
                             cookie1第一行,往上2行为NULL
                             cookie1第二行,往上2行为NULL
                             cookie1第四行,往上2行为第二行值,2015-04-10 10:00:02
                             cookie1第七行,往上2行为第五行值,2015-04-10 10:50:01


    ---LEAD

    与LAG相反
    LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
    第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
    LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
    FROM austin3;

    Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.86 sec   HDFS Read: 9978 HDFS Write: 991 SUCCESS
    Total MapReduce CPU Time Spent: 2 seconds 860 msec
    OK
    cookie1 2015-04-10 10:00:00     url1    1       2015-04-10 10:00:02     2015-04-10 10:03:04
    cookie1 2015-04-10 10:00:02     url2    2       2015-04-10 10:03:04     2015-04-10 10:10:00
    cookie1 2015-04-10 10:03:04     1url3   3       2015-04-10 10:10:00     2015-04-10 10:50:01
    cookie1 2015-04-10 10:10:00     url4    4       2015-04-10 10:50:01     2015-04-10 10:50:05
    cookie1 2015-04-10 10:50:01     url5    5       2015-04-10 10:50:05     2015-04-10 11:00:00
    cookie1 2015-04-10 10:50:05     url6    6       2015-04-10 11:00:00     NULL
    cookie1 2015-04-10 11:00:00     url7    7       1970-01-01 00:00:00     NULL
    cookie2 2015-04-10 10:00:00     url11   1       2015-04-10 10:00:02     2015-04-10 10:03:04
    cookie2 2015-04-10 10:00:02     url22   2       2015-04-10 10:03:04     2015-04-10 10:10:00
    cookie2 2015-04-10 10:03:04     1url33  3       2015-04-10 10:10:00     2015-04-10 10:50:01
    cookie2 2015-04-10 10:10:00     url44   4       2015-04-10 10:50:01     2015-04-10 10:50:05
    cookie2 2015-04-10 10:50:01     url55   5       2015-04-10 10:50:05     2015-04-10 11:00:00
    cookie2 2015-04-10 10:50:05     url66   6       2015-04-10 11:00:00     NULL
    cookie2 2015-04-10 11:00:00     url77   7       1970-01-01 00:00:00     NULL
    Time taken: 22.859 seconds, Fetched: 14 row(s)

    --逻辑与LAG一样,只不过LAG是往上,LEAD是往下。

    ---FIRST_VALUE

    取分组内排序后,截止到当前行,第一个值
    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
    FROM austin3;
    Total MapReduce CPU Time Spent: 2 seconds 920 msec
    OK
    cookie1 2015-04-10 10:00:00     url1    1       url1
    cookie1 2015-04-10 10:00:02     url2    2       url1
    cookie1 2015-04-10 10:03:04     1url3   3       url1
    cookie1 2015-04-10 10:10:00     url4    4       url1
    cookie1 2015-04-10 10:50:01     url5    5       url1
    cookie1 2015-04-10 10:50:05     url6    6       url1
    cookie1 2015-04-10 11:00:00     url7    7       url1
    cookie2 2015-04-10 10:00:00     url11   1       url11
    cookie2 2015-04-10 10:00:02     url22   2       url11
    cookie2 2015-04-10 10:03:04     1url33  3       url11
    cookie2 2015-04-10 10:10:00     url44   4       url11
    cookie2 2015-04-10 10:50:01     url55   5       url11
    cookie2 2015-04-10 10:50:05     url66   6       url11
    cookie2 2015-04-10 11:00:00     url77   7       url11
    Time taken: 21.91 seconds, Fetched: 14 row(s)


    --LAST_VALUE

    取分组内排序后,截止到当前行,最后一个值
    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
    FROM austin3;

    Total MapReduce CPU Time Spent: 2 seconds 850 msec
    OK
    cookie1 2015-04-10 10:00:00     url1    1       url1
    cookie1 2015-04-10 10:00:02     url2    2       url2
    cookie1 2015-04-10 10:03:04     1url3   3       1url3
    cookie1 2015-04-10 10:10:00     url4    4       url4
    cookie1 2015-04-10 10:50:01     url5    5       url5
    cookie1 2015-04-10 10:50:05     url6    6       url6
    cookie1 2015-04-10 11:00:00     url7    7       url7
    cookie2 2015-04-10 10:00:00     url11   1       url11
    cookie2 2015-04-10 10:00:02     url22   2       url22
    cookie2 2015-04-10 10:03:04     1url33  3       1url33
    cookie2 2015-04-10 10:10:00     url44   4       url44
    cookie2 2015-04-10 10:50:01     url55   5       url55
    cookie2 2015-04-10 10:50:05     url66   6       url66
    cookie2 2015-04-10 11:00:00     url77   7       url77
    Time taken: 22.864 seconds, Fetched: 14 row(s)
    如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

    SELECT cookieid,
    createtime,
    url,
    FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2  
    FROM austin3;

    Total MapReduce CPU Time Spent: 2 seconds 750 msec
    OK
    cookie1 2015-04-10 10:00:02     url2    url2
    cookie1 2015-04-10 10:50:01     url5    url2
    cookie1 2015-04-10 10:10:00     url4    url2
    cookie1 2015-04-10 11:00:00     url7    url2
    cookie1 2015-04-10 10:50:05     url6    url2
    cookie1 2015-04-10 10:03:04     1url3   url2
    cookie1 2015-04-10 10:00:00     url1    url2
    cookie2 2015-04-10 10:50:01     url55   url55
    cookie2 2015-04-10 10:10:00     url44   url55
    cookie2 2015-04-10 11:00:00     url77   url55
    cookie2 2015-04-10 10:50:05     url66   url55
    cookie2 2015-04-10 10:03:04     1url33  url55
    cookie2 2015-04-10 10:00:00     url11   url55
    cookie2 2015-04-10 10:00:02     url22   url55
    Time taken: 21.846 seconds, Fetched: 14 row(s)

    SELECT cookieid,
    createtime,
    url,
    LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2  
    FROM austin3;
     Total MapReduce CPU Time Spent: 2 seconds 950 msec
    OK
    cookie1 2015-04-10 10:00:02     url2    url1
    cookie1 2015-04-10 10:50:01     url5    url1
    cookie1 2015-04-10 10:10:00     url4    url1
    cookie1 2015-04-10 11:00:00     url7    url1
    cookie1 2015-04-10 10:50:05     url6    url1
    cookie1 2015-04-10 10:03:04     1url3   url1
    cookie1 2015-04-10 10:00:00     url1    url1
    cookie2 2015-04-10 10:50:01     url55   url22
    cookie2 2015-04-10 10:10:00     url44   url22
    cookie2 2015-04-10 11:00:00     url77   url22
    cookie2 2015-04-10 10:50:05     url66   url22
    cookie2 2015-04-10 10:03:04     1url33  url22
    cookie2 2015-04-10 10:00:00     url11   url22
    cookie2 2015-04-10 10:00:02     url22   url22
    Time taken: 24.82 seconds, Fetched: 14 row(s)

    如果想要取分组内排序后最后一个值,则需要变通一下:
    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
    FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
    FROM austin3
    ORDER BY cookieid,createtime;

    Total MapReduce CPU Time Spent: 8 seconds 0 msec
    OK
    cookie1 2015-04-10 10:00:00     url1    1       url1    url7
    cookie1 2015-04-10 10:00:02     url2    2       url2    url7
    cookie1 2015-04-10 10:03:04     1url3   3       1url3   url7
    cookie1 2015-04-10 10:10:00     url4    4       url4    url7
    cookie1 2015-04-10 10:50:01     url5    5       url5    url7
    cookie1 2015-04-10 10:50:05     url6    6       url6    url7
    cookie1 2015-04-10 11:00:00     url7    7       url7    url7
    cookie2 2015-04-10 10:00:00     url11   1       url11   url77
    cookie2 2015-04-10 10:00:02     url22   2       url22   url77
    cookie2 2015-04-10 10:03:04     1url33  3       1url33  url77
    cookie2 2015-04-10 10:10:00     url44   4       url44   url77
    cookie2 2015-04-10 10:50:01     url55   5       url55   url77
    cookie2 2015-04-10 10:50:05     url66   6       url66   url77
    cookie2 2015-04-10 11:00:00     url77   7       url77   url77
    Time taken: 66.336 seconds, Fetched: 14 row(s)
    提示:在使用分析函数的过程中,要特别注意ORDER BY子句,用的不恰当,统计出的结果就不是你所期望的

    Hive分析窗口函数(4) CUME_DIST,PERCENT_RANK
    数据准备;vi austin4
    d1,user1,1000
    d1,user2,2000
    d1,user3,3000
    d2,user4,4000
    d2,user5,5000
     
    CREATE TABLE austin4 (
    dept STRING,
    userid string,
    sal INT
    ) ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    stored as textfile;

    load data local inpath '/home/hadoop/austin4' into table austin4;

    hive> select * from austin4;
    OK
    d1      user1   1000
    d1      user2   2000
    d1      user3   3000
    d2      user4   4000
    d2      user5   5000
    Time taken: 0.055 seconds, Fetched: 5 row(s)

    –-CUME_DIST 小于等于当前值的行数/分组内总行数
    –比如,统计小于等于当前薪水的人数,所占总人数的比例
    SELECT
    dept,
    userid,
    sal,
    CUME_DIST() OVER(ORDER BY sal) AS rn1,
    CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2
    FROM austin4;

    Total MapReduce CPU Time Spent: 7 seconds 30 msec
    OK
    d1      user1   1000    0.2     0.3333333333333333
    d1      user2   2000    0.4     0.6666666666666666
    d1      user3   3000    0.6     1.0
    d2      user4   4000    0.8     0.5
    d2      user5   5000    1.0     1.0
    Time taken: 65.446 seconds, Fetched: 5 row(s)

    rn1: 没有partition,所有数据均为1组,总行数为5,
         第一行:小于等于1000的行数为1,因此,1/5=0.2
         第二行:小于等于2000的行数为2,因此,2/5=0.4
         第三行:小于等于3000的行数为3,因此,3/5=0.6
    rn2: 按照部门分组,dpet=d1的行数为3,
         第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666

    --PERCENT_RANK
    –PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
    应用场景不了解,可能在一些特殊算法的实现中可以用到吧。

    SELECT
    dept,
    userid,
    sal,
    PERCENT_RANK() OVER(ORDER BY sal) AS rn1,   --分组内
    RANK() OVER(ORDER BY sal) AS rn11,          --分组内RANK值
    SUM(1) OVER(PARTITION BY NULL) AS rn12,     --分组内总行数
    PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
    FROM austin4;

    Total MapReduce CPU Time Spent: 8 seconds 370 msec
    OK
    d1      user1   1000    0.0     1       5       0.0
    d1      user2   2000    0.25    2       5       0.5
    d1      user3   3000    0.5     3       5       1.0
    d2      user4   4000    0.75    4       5       0.0
    d2      user5   5000    1.0     5       5       1.0
    Time taken: 68.557 seconds, Fetched: 5 row(s)

    rn1: rn1 = (rn11-1) / (rn12-1)
           第一行,(1-1)/(5-1)=0/4=0
           第二行,(2-1)/(5-1)=1/4=0.25
           第四行,(4-1)/(5-1)=3/4=0.75
    rn2: 按照dept分组,
         dept=d1的总行数为3
         第一行,(1-1)/(3-1)=0
         第三行,(3-1)/(3-1)=1

        
    Hive分析窗口函数(5) SUM,AVG,MIN,MAX

    准备数据:vi austin5
    cookie1,2015-04-10,1
    cookie1,2015-04-11,5
    cookie1,2015-04-12,7
    cookie1,2015-04-13,3
    cookie1,2015-04-14,2
    cookie1,2015-04-15,4
    cookie1,2015-04-16,4

     --create table   
    CREATE  TABLE austin5 (
        cookieid string,
        createtime string,   --day
        pv INT
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    stored as textfile;
        
    --导入本地数据
    load data local inpath '/home/hadoop/austin5' into table austin5;
        
    hive> select * from austin5;
        OK
    cookie1 2015-04-10      1
    cookie1 2015-04-11      5
    cookie1 2015-04-12      7
    cookie1 2015-04-13      3
    cookie1 2015-04-14      2
    cookie1 2015-04-15      4
    cookie1 2015-04-16      4

    SUM — 注意,结果和ORDER BY相关,默认为升序

    SELECT cookieid,
        createtime,
        pv,
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
        SUM(pv) OVER(PARTITION BY cookieid) AS pv3,--分组内所有行
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3行
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1行
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  
    FROM austin5;

    SELECT cookieid,
        createtime,
        pv,
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,
        SUM(pv) OVER(PARTITION BY cookieid) AS pv3,
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,  
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    
        SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6  
    FROM austin5;
    Total MapReduce CPU Time Spent: 6 seconds 10 msec
    OK
    cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6
    -----------------------------------------------------------------------------
    cookie1  2015-04-10      1       1       1       26      1       6       26
    cookie1  2015-04-11      5       6       6       26      6       13      25
    cookie1  2015-04-12      7       13      13      26      13      16      20
    cookie1  2015-04-13      3       16      16      26      16      18      13
    cookie1  2015-04-14      2       18      18      26      17      21      10
    cookie1  2015-04-15      4       22      22      26      16      20      8
    cookie1  2015-04-16      4       26      26      26      13      13      4
    Time taken: 51.209 seconds, Fetched: 7 row(s)

    pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
    pv2: 同pv1
    pv3: 分组内(cookie1)所有的pv累加
    pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
    pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
    pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10

    如果不指定ROWS BETWEEN,默认为从起点到当前行;
    如果不指定ORDER BY,则将分组内所有值累加;
    关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
    PRECEDING:往前
    FOLLOWING:往后
    CURRENT ROW:当前行
    UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

        --AVG
        SELECT cookieid,
        createtime,
        pv,
        AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
        AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
        AVG(pv) OVER(PARTITION BY cookieid) AS pv3,    --分组内所有行
        AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3行
        AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1行
        AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  
        FROM lxw1234;
        cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6
        -----------------------------------------------------------------------------
        cookie1 2015-04-10      1       1.0     1.0     3.7142857142857144      1.0     3.0     3.7142857142857144
        cookie1 2015-04-11      5       3.0     3.0     3.7142857142857144      3.0     4.333333333333333       4.166666666666667
        cookie1 2015-04-12      7       4.333333333333333       4.333333333333333       3.7142857142857144      4.333333333333333       4.0     4.0
        cookie1 2015-04-13      3       4.0     4.0     3.7142857142857144      4.0     3.6     3.25
        cookie1 2015-04-14      2       3.6     3.6     3.7142857142857144      4.25    4.2     3.3333333333333335
        cookie1 2015-04-15      4       3.6666666666666665      3.6666666666666665      3.7142857142857144      4.0     4.0     4.0
        cookie1 2015-04-16      4       3.7142857142857144      3.7142857142857144      3.7142857142857144      3.25    3.25    4.0

        --MIN
        SELECT cookieid,
        createtime,
        pv,
        MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
        MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
        MIN(pv) OVER(PARTITION BY cookieid) AS pv3,--分组内所有行
        MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3行
        MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1行
        MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  
        FROM lxw1234;
         
        cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6
        -----------------------------------------------------------------------------
        cookie1 2015-04-10      1       1       1       1       1       1       1
        cookie1 2015-04-11      5       1       1       1       1       1       2
        cookie1 2015-04-12      7       1       1       1       1       1       2
        cookie1 2015-04-13      3       1       1       1       1       1       2
        cookie1 2015-04-14      2       1       1       1       2       2       2
        cookie1 2015-04-15      4       1       1       1       2       2       4
        cookie1 2015-04-16      4       1       1       1       2       2       4

        ----MAX
        SELECT cookieid,
        createtime,
        pv,
        MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
        MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
        MAX(pv) OVER(PARTITION BY cookieid) AS pv3,                                --分组内所有行
        MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3行
        MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1行
        MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  
        FROM lxw1234;
         
        cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6
        -----------------------------------------------------------------------------
        cookie1 2015-04-10      1       1       1       7       1       5       7
        cookie1 2015-04-11      5       5       5       7       5       7       7
        cookie1 2015-04-12      7       7       7       7       7       7       7
        cookie1 2015-04-13      3       7       7       7       7       7       4
        cookie1 2015-04-14      2       7       7       7       7       7       4
        cookie1 2015-04-15      4       7       7       7       7       7       4
        cookie1 2015-04-16      4       7       7       7       4       4       4




  • 相关阅读:
    mfc中的_T
    zmq的send
    c++内存相关函数
    如何运行linux shell程序
    Dockfile中的命令如何在.sh中执行
    Linux 错误: $' ': command not found
    实战ZeroMQ的PUSH/PULL推拉模式
    Servlet笔记
    进程控制块(PCB)
    Makefile规则介绍
  • 原文地址:https://www.cnblogs.com/yejibigdata/p/6376409.html
Copyright © 2011-2022 走看看