zoukankan      html  css  js  c++  java
  • Hive开窗函数整理

    分析函数用于计算基于组的某种聚合值,它和聚合函数的不同之处是:对于每个组返回多行,而聚合函数对于每个组只返回一行。

    开窗函数指定了分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化!到底什么是数据窗口?后面举例会详细讲到!

    基础结构:

    分析函数(如:sum(),max(),row_number()...) + 窗口子句(over函数)

    over函数写法: over(partition by cookieid order by createtime) 先根据cookieid字段分区,相同的cookieid分为一区,每个分区内根据createtime字段排序(默认升序)

    注:不加 partition by 的话则把整个数据集当作一个分区,不加 order by的话会对某些函数统计结果产生影响,如sum()

    版本:1.1.0-cdh5.13.3

    测试表

    create table if not exists test (
    cookie_id string,
    create_time string,
    pv int
    )row format delimited fields terminated by ',';

    测试数据

    a,2017-12-01,3
    b,2017-12-02,3
    cookie1,2017-12-02,4
    cookie1,2017-12-03,2
    cookie1,2017-12-04,3
    cookie1,2017-12-05,1
    cookie1,2017-12-06,6
    cookie1,2017-12-07,7
    cookie2,2017-12-02,1
    cookie2,2017-12-04,2
    cookie3,2017-12-06,7
    cookie3,2017-12-03,5

    SUM、AVG、MIN、MAX

    用于实现分组内所有和连续累积的统计。

    以SUM举例

    SELECT cookie_id,create_time,pv,
    SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time) AS pv1, -- 默认为从起点到当前行
    SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 
    SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3,   --当前行+往前3行
    SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4,    --当前行+往前3行+往后1行
    SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5   ---当前行+往后所有行  
    FROM test;

    结果:

    +------------+--------------+-----+------+------+------+------+------+--+
    | cookie_id  | create_time  | pv  | pv1  | pv2  | pv3  | pv4  | pv5  |
    +------------+--------------+-----+------+------+------+------+------+--+
    | a          | 2017-12-01   | 3   | 3    | 3    | 3    | 3    | 3    |
    | b          | 2017-12-02   | 3   | 3    | 3    | 3    | 3    | 3    |
    | cookie1    | 2017-12-02   | 4   | 4    | 4    | 4    | 6    | 23   |
    | cookie1    | 2017-12-03   | 2   | 6    | 6    | 6    | 9    | 19   |
    | cookie1    | 2017-12-04   | 3   | 9    | 9    | 9    | 10   | 17   |
    | cookie1    | 2017-12-05   | 1   | 10   | 10   | 10   | 16   | 14   |
    | cookie1    | 2017-12-06   | 6   | 16   | 16   | 12   | 19   | 13   |
    | cookie1    | 2017-12-07   | 7   | 23   | 23   | 17   | 17   | 7    |
    | cookie2    | 2017-12-02   | 1   | 1    | 1    | 1    | 3    | 3    |
    | cookie2    | 2017-12-04   | 2   | 3    | 3    | 3    | 3    | 2    |
    | cookie3    | 2017-12-03   | 5   | 5    | 5    | 5    | 12   | 12   |
    | cookie3    | 2017-12-06   | 7   | 12   | 12   | 12   | 12   | 7    |
    +------------+--------------+-----+------+------+------+------+------+--+

    注:这些窗口的划分都是在分区内部!超过分区大小就无效了

    可以看到如果不指定ROWS BETWEEN,默认统计窗口为从起点到当前行;

    关键是理解 ROWS BETWEEN 含义,也叫做window子句:

    • PRECEDING:往前

    • FOLLOWING:往后

    • CURRENT ROW:当前行

    • UNBOUNDED:无边界,UNBOUNDED PRECEDING 表示从最前面的起点开始, UNBOUNDED FOLLOWING:表示到最后面的终点

    其他AVG,MIN,MAX,和SUM用法一样

    如果不加ORDER BY 会怎样

    select cookie_id,create_time,pv,
    sum(pv) over(PARTITION BY cookie_id) as pv1 
    FROM test;

    结果

    +------------+--------------+-----+------+--+
    | cookie_id  | create_time  | pv  | pv1  |
    +------------+--------------+-----+------+--+
    | a          | 2017-12-01   | 3   | 3    |
    | b          | 2017-12-02   | 3   | 3    |
    | cookie1    | 2017-12-07   | 7   | 23   |
    | cookie1    | 2017-12-06   | 6   | 23   |
    | cookie1    | 2017-12-05   | 1   | 23   |
    | cookie1    | 2017-12-04   | 3   | 23   |
    | cookie1    | 2017-12-03   | 2   | 23   |
    | cookie1    | 2017-12-02   | 4   | 23   |
    | cookie2    | 2017-12-04   | 2   | 3    |
    | cookie2    | 2017-12-02   | 1   | 3    |
    | cookie3    | 2017-12-03   | 5   | 12   |
    | cookie3    | 2017-12-06   | 7   | 12   |
    +------------+--------------+-----+------+--+

    可以看到,如果没有order by,不仅分区内没有排序,sum()计算的pv也是整个分区的pv

    注:max()函数无论有没有order by 都是计算整个分区的最大值

    ROW_NUMBER 函数

    ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列

    ROW_NUMBER() 的应用场景非常多,比如获取分组内排序Top N的记录、获取一个session中的第一条refer等。

    SELECT cookie_id,create_time,pv,
    ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rn  
    FROM test;

    结果

    +------------+--------------+-----+-----+--+
    | cookie_id  | create_time  | pv  | rn  |
    +------------+--------------+-----+-----+--+
    | a          | 2017-12-01   | 3   | 1   |
    | b          | 2017-12-02   | 3   | 1   |
    | cookie1    | 2017-12-07   | 7   | 1   |
    | cookie1    | 2017-12-06   | 6   | 2   |
    | cookie1    | 2017-12-02   | 4   | 3   |
    | cookie1    | 2017-12-04   | 3   | 4   |
    | cookie1    | 2017-12-03   | 2   | 5   |
    | cookie1    | 2017-12-05   | 1   | 6   |
    | cookie2    | 2017-12-04   | 2   | 1   |
    | cookie2    | 2017-12-02   | 1   | 2   |
    | cookie3    | 2017-12-06   | 7   | 1   |
    | cookie3    | 2017-12-03   | 5   | 2   |
    +------------+--------------+-----+-----+--+

    RANK 和 DENSE_RANK 函数

    RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位 DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

    我们把 rankdense_rankrow_number三者对比,这样比较清晰:

    SELECT cookie_id,create_time,pv,
    RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rank_res,
    DENSE_RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS dense_rank_res,
    ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS row_number_res
    FROM test;

    结果:

    1. 为了突出区别,对test表做了一点调整,将数据cookie1,2017-12-02,4改为了cookie1,2017-12-02,3

    2. 主要看cookie1,2017-12-04和2017-12-02的pv都是3

    +------------+--------------+-----+-----------+-----------------+-----------------+--+
    | cookie_id  | create_time  | pv  | rank_res  | dense_rank_res  | row_number_res  |
    +------------+--------------+-----+-----------+-----------------+-----------------+--+
    | a          | 2017-12-01   | 3   | 1         | 1               | 1               |
    | b          | 2017-12-02   | 3   | 1         | 1               | 1               |
    | cookie1    | 2017-12-07   | 7   | 1         | 1               | 1               |
    | cookie1    | 2017-12-06   | 6   | 2         | 2               | 2               |
    | cookie1    | 2017-12-04   | 3   | 3         | 3               | 3               |
    | cookie1    | 2017-12-02   | 3   | 3         | 3               | 4               |
    | cookie1    | 2017-12-03   | 2   | 5         | 4               | 5               |
    | cookie1    | 2017-12-05   | 1   | 6         | 5               | 6               |
    | cookie2    | 2017-12-04   | 2   | 1         | 1               | 1               |
    | cookie2    | 2017-12-02   | 1   | 2         | 2               | 2               |
    | cookie3    | 2017-12-06   | 7   | 1         | 1               | 1               |
    | cookie3    | 2017-12-03   | 5   | 2         | 2               | 2               |
    +------------+--------------+-----+-----------+-----------------+-----------------+--+

    LAG 和 LEAD 函数

    LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

    第一个参数为列名,第二个参数为往上第n行(默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

    SELECT cookie_id,create_time,
    LAG(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1,
    LAG(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1_with_default
    FROM test;

    结果:

    +------------+--------------+-------------+--------------------+--+
    | cookie_id  | create_time  |    lag1     | lag1_with_default  |
    +------------+--------------+-------------+--------------------+--+
    | a          | 2017-12-01   | NULL        | 1970-01-01         |
    | b          | 2017-12-02   | NULL        | 1970-01-01         |
    | cookie1    | 2017-12-02   | NULL        | 1970-01-01         |
    | cookie1    | 2017-12-03   | 2017-12-02  | 2017-12-02         |
    | cookie1    | 2017-12-04   | 2017-12-03  | 2017-12-03         |
    | cookie1    | 2017-12-05   | 2017-12-04  | 2017-12-04         |
    | cookie1    | 2017-12-06   | 2017-12-05  | 2017-12-05         |
    | cookie1    | 2017-12-07   | 2017-12-06  | 2017-12-06         |
    | cookie2    | 2017-12-02   | NULL        | 1970-01-01         |
    | cookie2    | 2017-12-04   | 2017-12-02  | 2017-12-02         |
    | cookie3    | 2017-12-03   | NULL        | 1970-01-01         |
    | cookie3    | 2017-12-06   | 2017-12-03  | 2017-12-03         |
    +------------+--------------+-------------+--------------------+--+

    LEAD 函数则与 LAG 相反: LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值

    第一个参数为列名,第二个参数为往下第n行(默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

    SELECT cookie_id,create_time,
    LEAD(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2,
    LEAD(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2_with_default
    FROM test;

    结果:

    +------------+--------------+-------------+--------------------+--+
    | cookie_id  | create_time  |    lag2     | lag2_with_default  |
    +------------+--------------+-------------+--------------------+--+
    | a          | 2017-12-01   | NULL        | 1970-01-01         |
    | b          | 2017-12-02   | NULL        | 1970-01-01         |
    | cookie1    | 2017-12-02   | 2017-12-03  | 2017-12-03         |
    | cookie1    | 2017-12-03   | 2017-12-04  | 2017-12-04         |
    | cookie1    | 2017-12-04   | 2017-12-05  | 2017-12-05         |
    | cookie1    | 2017-12-05   | 2017-12-06  | 2017-12-06         |
    | cookie1    | 2017-12-06   | 2017-12-07  | 2017-12-07         |
    | cookie1    | 2017-12-07   | NULL        | 1970-01-01         |
    | cookie2    | 2017-12-02   | 2017-12-04  | 2017-12-04         |
    | cookie2    | 2017-12-04   | NULL        | 1970-01-01         |
    | cookie3    | 2017-12-03   | 2017-12-06  | 2017-12-06         |
    | cookie3    | 2017-12-06   | NULL        | 1970-01-01         |
    +------------+--------------+-------------+--------------------+--+

     

    GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

    准备数据:

    +----------+-------------+-----------+--+
    |  month   |     day     | cookieid  |
    +----------+-------------+-----------+--+
    | 2015-03  | 2015-03-10  | cookie1   |
    | 2015-03  | 2015-03-10  | cookie5   |
    | 2015-03  | 2015-03-12  | cookie7   |
    | 2015-04  | 2015-04-12  | cookie3   |
    | 2015-04  | 2015-04-13  | cookie2   |
    | 2015-04  | 2015-04-13  | cookie4   |
    | 2015-04  | 2015-04-16  | cookie4   |
    | 2015-03  | 2015-03-10  | cookie2   |
    | 2015-03  | 2015-03-10  | cookie3   |
    | 2015-04  | 2015-04-12  | cookie5   |
    | 2015-04  | 2015-04-13  | cookie6   |
    | 2015-04  | 2015-04-15  | cookie3   |
    | 2015-04  | 2015-04-15  | cookie2   |
    | 2015-04  | 2015-04-16  | cookie1   |
    +----------+-------------+-----------+--+

    表:

    create table test2(
    month STRING,
    day STRING, 
    cookieid STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

    GROUPING SETS

    在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL

    SELECT month
        ,day
        ,COUNT(DISTINCT cookieid) AS uv
    FROM test2
    GROUP BY month,day 
    GROUPING SETS(month, day);

    结果:

    +----------+-------------+-----+--+
    |  month   |     day     | uv  |
    +----------+-------------+-----+--+
    | NULL     | 2015-03-10  | 4   |
    | NULL     | 2015-03-12  | 1   |
    | NULL     | 2015-04-12  | 2   |
    | NULL     | 2015-04-13  | 3   |
    | NULL     | 2015-04-15  | 2   |
    | NULL     | 2015-04-16  | 2   |
    | 2015-03  | NULL        | 5   |
    | 2015-04  | NULL        | 6   |
    +----------+-------------+-----+--+

    等价于

    SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY month 
    UNION ALL 
    SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY day;

    增强版

    SELECT 
        month,
        day,
        COUNT(DISTINCT cookieid) AS uv,
        GROUPING__ID 
    FROM test2 
        GROUP BY month,day 
        GROUPING SETS (month,day,(month,day)) 
        ORDER BY GROUPING__ID;

    其中的 GROUPING__ID,表示结果属于哪一个分组集合。

    结果:

    +----------+-------------+-----+---------------+--+
    |  month   |     day     | uv  | grouping__id  |
    +----------+-------------+-----+---------------+--+
    | 2015-04  | NULL        | 6   | 1             |
    | 2015-03  | NULL        | 5   | 1             |
    | NULL     | 2015-03-10  | 4   | 2             |
    | NULL     | 2015-04-16  | 2   | 2             |
    | NULL     | 2015-04-15  | 2   | 2             |
    | NULL     | 2015-04-13  | 3   | 2             |
    | NULL     | 2015-04-12  | 2   | 2             |
    | NULL     | 2015-03-12  | 1   | 2             |
    | 2015-04  | 2015-04-16  | 2   | 3             |
    | 2015-04  | 2015-04-12  | 2   | 3             |
    | 2015-04  | 2015-04-13  | 3   | 3             |
    | 2015-03  | 2015-03-12  | 1   | 3             |
    | 2015-03  | 2015-03-10  | 4   | 3             |
    | 2015-04  | 2015-04-15  | 2   | 3             |
    +----------+-------------+-----+---------------+--+

    等价于

    SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month 
    UNION ALL 
    SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day
    UNION ALL 
    SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;

    CUBE

    根据GROUP BY的维度的所有组合进行聚合。

    SELECT 
        month,
        day,
        COUNT(DISTINCT cookieid) AS uv,
        GROUPING__ID 
    FROM test2 
        GROUP BY month,day 
        WITH CUBE 
        ORDER BY GROUPING__ID;

    结果

    +----------+-------------+-----+---------------+--+
    |  month   |     day     | uv  | grouping__id  |
    +----------+-------------+-----+---------------+--+
    | NULL     | NULL        | 7   | 0             |
    | 2015-03  | NULL        | 5   | 1             |
    | 2015-04  | NULL        | 6   | 1             |
    | NULL     | 2015-04-16  | 2   | 2             |
    | NULL     | 2015-04-15  | 2   | 2             |
    | NULL     | 2015-04-13  | 3   | 2             |
    | NULL     | 2015-04-12  | 2   | 2             |
    | NULL     | 2015-03-12  | 1   | 2             |
    | NULL     | 2015-03-10  | 4   | 2             |
    | 2015-04  | 2015-04-12  | 2   | 3             |
    | 2015-04  | 2015-04-16  | 2   | 3             |
    | 2015-03  | 2015-03-12  | 1   | 3             |
    | 2015-03  | 2015-03-10  | 4   | 3             |
    | 2015-04  | 2015-04-15  | 2   | 3             |
    | 2015-04  | 2015-04-13  | 3   | 3             |
    +----------+-------------+-----+---------------+--+

    等价于

    SELECT NULL as month,NULL as day,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM test2
    UNION ALL 
    SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month 
    UNION ALL 
    SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day
    UNION ALL 
    SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;

    ROLLUP

    是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

    比如,以month维度进行层级聚合:

    SELECT 
        month,
        day,
        COUNT(DISTINCT cookieid) AS uv,
        GROUPING__ID  
    FROM test2 
        GROUP BY month,day
        WITH ROLLUP 
        ORDER BY GROUPING__ID;

    结果:

    +----------+-------------+-----+---------------+--+
    |  month   |     day     | uv  | grouping__id  |
    +----------+-------------+-----+---------------+--+
    | NULL     | NULL        | 7   | 0             |
    | 2015-04  | NULL        | 6   | 1             |
    | 2015-03  | NULL        | 5   | 1             |
    | 2015-04  | 2015-04-16  | 2   | 3             |
    | 2015-04  | 2015-04-15  | 2   | 3             |
    | 2015-04  | 2015-04-13  | 3   | 3             |
    | 2015-04  | 2015-04-12  | 2   | 3             |
    | 2015-03  | 2015-03-12  | 1   | 3             |
    | 2015-03  | 2015-03-10  | 4   | 3             |
    +----------+-------------+-----+---------------+--+

    实现的上钻过程:月天的UV->月的UV->总UV

    如果把month和day调换顺序,则以day维度进行层级聚合:

    SELECT 
        day,
        month,
        COUNT(DISTINCT cookieid) AS uv,
        GROUPING__ID  
    FROM test2 
        GROUP BY day,month 
        WITH ROLLUP 
        ORDER BY GROUPING__ID;

    结果:

    +-------------+----------+-----+---------------+--+
    |     day     |  month   | uv  | grouping__id  |
    +-------------+----------+-----+---------------+--+
    | NULL        | NULL     | 7   | 0             |
    | 2015-04-12  | NULL     | 2   | 1             |
    | 2015-04-15  | NULL     | 2   | 1             |
    | 2015-03-12  | NULL     | 1   | 1             |
    | 2015-04-16  | NULL     | 2   | 1             |
    | 2015-03-10  | NULL     | 4   | 1             |
    | 2015-04-13  | NULL     | 3   | 1             |
    | 2015-04-16  | 2015-04  | 2   | 3             |
    | 2015-04-15  | 2015-04  | 2   | 3             |
    | 2015-04-13  | 2015-04  | 3   | 3             |
    | 2015-03-12  | 2015-03  | 1   | 3             |
    | 2015-03-10  | 2015-03  | 4   | 3             |
    | 2015-04-12  | 2015-04  | 2   | 3             |
    +-------------+----------+-----+---------------+--+

    实现的上钻过程:天月的UV->天的UV->总UV

     


     

    参考:http://lxw1234.com/archives/category/hive

    参考:https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

    参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

     

  • 相关阅读:
    Corosync+Pacemaker+DRBD+Mysql高可用HA配置
    高可用集群corosync+pacemaker之crmsh使用
    nfs+DRBD+corosync+pacemaker 实现高可用(ha)的nfs集群
    nginx匹配
    2011年的最后一篇
    Python自然语言处理学习笔记(66):7.7 小结
    Python自然语言处理学习笔记(65):7.6 关系抽取
    Python自然语言处理学习笔记(64): 7.5 命名实体识别
    Python自然语言处理学习笔记(67):7.8 扩展阅读
    Python自然语言处理学习笔记(68):7.9 练习
  • 原文地址:https://www.cnblogs.com/zz-ksw/p/12917693.html
Copyright © 2011-2022 走看看