zoukankan      html  css  js  c++  java
  • hive函数之~窗口函数与分析函数

    hive当中也带有很多的窗口函数以及分析函数,主要用于以下这些场景

    (1)用于分区排序 
    (2)动态Group By 
    (3)Top N 
    (4)累计计算 
    (5)层次查询

    1、创建hive表并加载数据

    创建表

    hive (hive_explode)> create table order_detail(
                        user_id string,device_id string,user_type string,price double,sales int
                        )row format delimited fields terminated by ',';

    加载数据

    cd /export/servers/hivedatas
    vim order_detail
    zhangsan,1,new,67.1,2
    lisi,2,old,43.32,1
    wagner,3,new,88.88,3
    liliu,4,new,66.0,1
    qiuba,5,new,54.32,1
    wangshi,6,old,77.77,2
    liwei,7,old,88.44,3
    wutong,8,new,56.55,6
    lilisi,9,new,88.88,5
    qishili,10,new,66.66,5

    加载数据

    hive (hive_explode)> load data local inpath '/export/servers/hivedatas/order_detail' into table order_detail;

    2、窗口函数

    FIRST_VALUE:取分组内排序后,截止到当前行,第一个值

    LAST_VALUE: 取分组内排序后,截止到当前行,最后一个值

    LEAD(col,n,DEFAULT) :用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

    LAG(col,n,DEFAULT) :与lead相反,用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

    3、OVER从句

    1、使用标准的聚合函数COUNT、SUM、MIN、MAX、AVG

    2、使用PARTITION BY语句,使用一个或者多个原始数据类型的列

    3、使用PARTITION BY与ORDER BY语句,使用一个或者多个数据类型的分区或者排序列

    4、使用窗口规范,窗口规范支持以下格式:

    (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

    1

    2

    3

    当ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

    当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

    OVER从句支持以下函数, 但是并不支持和窗口一起使用它们。

    Ranking函数: Rank, NTile, DenseRank, CumeDist, PercentRank.

    Lead 和 Lag 函数.

    使用窗口函数进行统计求销量

    使用窗口函数sum  over统计销量

    hive (hive_explode)> select
    user_id,
    user_type,
    sales,
    --分组内所有行
    sum(sales) over(partition by user_type) AS sales_1 ,
    sum(sales) over(order  by user_type) AS sales_2 ,
    --默认为从起点到当前行,如果sales相同,累加结果相同
    sum(sales) over(partition by user_type order by sales asc) AS sales_3,
    --从起点到当前行,结果与sales_3不同。 根据排序先后不同,可能结果累加不同
    sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) AS sales_4,
    --当前行+往前3行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) AS sales_5,
    --当前行+往前3行+往后1行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) AS sales_6,
    --当前行+往后所有行 
    sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) AS sales_7
    from
    order_detail
    order by
        user_type,
        sales,
        user_id;

    统计之后求得结果如下:

    +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+
    |  user_id  | user_type  | sales  | sales_1  | sales_2  | sales_3  | sales_4  | sales_5  | sales_6  | sales_7  |
    +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+
    | liliu     | new        | 1      | 23       | 23       | 2        | 2        | 2        | 4        | 22       |
    | qiuba     | new        | 1      | 23       | 23       | 2        | 1        | 1        | 2        | 23       |
    | zhangsan  | new        | 2      | 23       | 23       | 4        | 4        | 4        | 7        | 21       |
    | wagner    | new        | 3      | 23       | 23       | 7        | 7        | 7        | 12       | 19       |
    | lilisi    | new        | 5      | 23       | 23       | 17       | 17       | 15       | 21       | 11       |
    | qishili   | new        | 5      | 23       | 23       | 17       | 12       | 11       | 16       | 16       |
    | wutong    | new        | 6      | 23       | 23       | 23       | 23       | 19       | 19       | 6        |
    | lisi      | old        | 1      | 6        | 29       | 1        | 1        | 1        | 3        | 6        |
    | wangshi   | old        | 2      | 6        | 29       | 3        | 3        | 3        | 6        | 5        |
    | liwei     | old        | 3      | 6        | 29       | 6        | 6        | 6        | 6        | 3        |
    +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+

    注意:

    结果和ORDER BY相关,默认为升序

    如果不指定ROWS BETWEEN,默认为从起点到当前行;

    如果不指定ORDER BY,则将分组内所有值累加;

    关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:

    PRECEDING:往前

    FOLLOWING:往后

    CURRENT ROW:当前行

    UNBOUNDED:无界限(起点或终点)

    UNBOUNDED PRECEDING:表示从前面的起点

    UNBOUNDED FOLLOWING:表示到后面的终点

    其他COUNT、AVG,MIN,MAX,和SUM用法一样。

    求分组后的第一个和最后一个值first_value与last_value

    使用first_value和last_value求分组后的第一个和最后一个值

    select
        user_id,
        user_type,
        ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num, 
        first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
        first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
        last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
        last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
    from
        order_detail;
    +-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+
    |  user_id  | user_type  | row_num  | max_sales_user  | min_sales_user  | curr_last_min_user  | curr_last_max_user  |
    +-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+
    | wutong    | new        | 7        | wutong          | qiuba           | wutong              | wutong              |
    | lilisi    | new        | 6        | wutong          | qiuba           | qishili             | lilisi              |
    | qishili   | new        | 5        | wutong          | qiuba           | qishili             | lilisi              |
    | wagner    | new        | 4        | wutong          | qiuba           | wagner              | wagner              |
    | zhangsan  | new        | 3        | wutong          | qiuba           | zhangsan            | zhangsan            |
    | liliu     | new        | 2        | wutong          | qiuba           | qiuba               | liliu               |
    | qiuba     | new        | 1        | wutong          | qiuba           | qiuba               | liliu               |
    | liwei     | old        | 3        | liwei           | lisi            | liwei               | liwei               |
    | wangshi   | old        | 2        | liwei           | lisi            | wangshi             | wangshi             |
    | lisi      | old        | 1        | liwei           | lisi            | lisi                | lisi                |
    +-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+

    4、分析函数

    1、  ROW_NUMBER():

    从1开始,按照顺序,生成分组内记录的序列,比如,按照pv降序排列,生成分组内每天的pv名次,ROW_NUMBER()的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。

    2、  RANK() :

    生成数据项在分组中的排名,排名相等会在名次中留下空位

    3、  DENSE_RANK() :

    生成数据项在分组中的排名,排名相等会在名次中不会留下空位

    4、  CUME_DIST :

    小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例

    5、  PERCENT_RANK :

    分组内当前行的RANK值-1/分组内总行数-1

    6、  NTILE(n) :

    用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。

    RANK、ROW_NUMBER、DENSE_RANK  OVER的使用

    使用这几个函数,可以实现分组求topN

    需求:按照用户类型进行分类,求取销售量最大的前N条数据

    select
        user_id,user_type,sales,
        RANK() over (partition by user_type order by sales desc) as r,
        ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
        DENSE_RANK() over (partition by user_type order by sales desc) as dr
    from
        order_detail;  
    +-----------+------------+--------+----+-----+-----+--+
    |  user_id  | user_type  | sales  | r  | rn  | dr  |
    +-----------+------------+--------+----+-----+-----+--+
    | wutong    | new        | 6      | 1  | 1   | 1   |
    | qishili   | new        | 5      | 2  | 2   | 2   |
    | lilisi    | new        | 5      | 2  | 3   | 2   |
    | wagner    | new        | 3      | 4  | 4   | 3   |
    | zhangsan  | new        | 2      | 5  | 5   | 4   |
    | qiuba     | new        | 1      | 6  | 6   | 5   |
    | liliu     | new        | 1      | 6  | 7   | 5   |
    | liwei     | old        | 3      | 1  | 1   | 1   |
    | wangshi   | old        | 2      | 2  | 2   | 2   |
    | lisi      | old        | 1      | 3  | 3   | 3   |
    +-----------+------------+--------+----+-----+-----+--+

    使用NTILE求取百分比

    我们可以使用NTILE来将我们的数据分成多少份,然后求取百分比

    使用NTILE将数据进行分片

    select
        user_type,sales,
        --分组内将数据分成2片
        NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
        --分组内将数据分成3片   
        NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
        --分组内将数据分成4片   
        NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
        --将所有数据分成4片
        NTILE(4) OVER(ORDER BY sales) AS all_nt4
    from
        order_detail
    order by
        user_type,
        sales;

    得到结果如下:

    +------------+--------+------+------+------+----------+--+
    | user_type  | sales  | nt2  | nt3  | nt4  | all_nt4  |
    +------------+--------+------+------+------+----------+--+
    | new        | 1      | 1    | 1    | 1    | 1        |
    | new        | 1      | 1    | 1    | 1    | 1        |
    | new        | 2      | 1    | 1    | 2    | 2        |
    | new        | 3      | 1    | 2    | 2    | 3        |
    | new        | 5      | 2    | 2    | 3    | 4        |
    | new        | 5      | 2    | 3    | 3    | 3        |
    | new        | 6      | 2    | 3    | 4    | 4        |
    | old        | 1      | 1    | 1    | 1    | 1        |
    | old        | 2      | 1    | 2    | 2    | 2        |
    | old        | 3      | 2    | 3    | 3    | 2        |
    +------------+--------+------+------+------+----------+--+

    使用NTILE求取sales前20%的用户id

    select
        user_id
    from
    (select  user_id, NTILE(5) OVER(ORDER BY sales desc) AS nt
        from  order_detail
    )A
    where nt=1;
    +----------+--+
    | user_id  |
    +----------+--+
    | wutong   |
    | qishili  |

    5、增强的聚合Cuhe和Grouping和Rollup

    这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。

    GROUPING SETS

    在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL,

    其中的GROUPING__ID,表示结果属于哪一个分组集合。

    需求:按照user_type和sales分别进行分组求取数据

    0: jdbc:hive2://node03:10000>select
        user_type,
        sales,
        count(user_id) as pv,
        GROUPING__ID
    from
        order_detail
    group by
        user_type,sales
    GROUPING SETS(user_type,sales)
    ORDER BY
        GROUPING__ID;

    求取结果如下:

    +------------+--------+-----+---------------+--+
    | user_type  | sales  | pv  | grouping__id  |
    +------------+--------+-----+---------------+--+
    | old        | NULL   | 3   | 1             |
    | new        | NULL   | 7   | 1             |
    | NULL       | 6      | 1   | 2             |
    | NULL       | 5      | 2   | 2             |
    | NULL       | 3      | 2   | 2             |
    | NULL       | 2      | 2   | 2             |
    | NULL       | 1      | 3   | 2             |
    +------------+--------+-----+---------------+--+

    需求:按照user_type,sales,以及user_type + salse  分别进行分组求取统计数据

    0: jdbc:hive2://node03:10000>select
        user_type,
        sales,
        count(user_id) as pv,
        GROUPING__ID
    from
        order_detail
    group by
        user_type,sales
    GROUPING SETS(user_type,sales,(user_type,sales))
    ORDER BY
        GROUPING__ID;

    求取结果如下:

    +------------+--------+-----+---------------+--+
    | user_type  | sales  | pv  | grouping__id  |
    +------------+--------+-----+---------------+--+
    | old        | NULL   | 3   | 1             |
    | new        | NULL   | 7   | 1             |
    | NULL       | 1      | 3   | 2             |
    | NULL       | 6      | 1   | 2             |
    | NULL       | 5      | 2   | 2             |
    | NULL       | 3      | 2   | 2             |
    | NULL       | 2      | 2   | 2             |
    | old        | 3      | 1   | 3             |
    | old        | 2      | 1   | 3             |
    | old        | 1      | 1   | 3             |
    | new        | 6      | 1   | 3             |
    | new        | 5      | 2   | 3             |
    | new        | 3      | 1   | 3             |
    | new        | 1      | 2   | 3             |
    | new        | 2      | 1   | 3             |
    +------------+--------+-----+---------------+--+

    6、使用cube 和ROLLUP 根据GROUP BY的维度的所有组合进行聚合。

    cube进行聚合

    需求:不进行任何的分组,按照user_type进行分组,按照sales进行分组,按照user_type+sales进行分组求取统计数据

    0: jdbc:hive2://node03:10000>select
        user_type,
        sales,
        count(user_id) as pv,
        GROUPING__ID
    from
        order_detail
    group by
        user_type,sales
    WITH CUBE
    ORDER BY
        GROUPING__ID;
    +------------+--------+-----+---------------+--+
    | user_type  | sales  | pv  | grouping__id  |
    +------------+--------+-----+---------------+--+
    | NULL       | NULL   | 10  | 0             |
    | new        | NULL   | 7   | 1             |
    | old        | NULL   | 3   | 1             |
    | NULL       | 6      | 1   | 2             |
    | NULL       | 5      | 2   | 2             |
    | NULL       | 3      | 2   | 2             |
    | NULL       | 2      | 2   | 2             |
    | NULL       | 1      | 3   | 2             |
    | old        | 3      | 1   | 3              |
    | old        | 2      | 1   | 3              |
    | old        | 1      | 1   | 3              |
    | new        | 6      | 1   | 3             |
    | new        | 5      | 2   | 3             |
    | new        | 3      | 1   | 3             |
    | new        | 2      | 1   | 3             |
    | new        | 1      | 2   | 3             |
    +------------+--------+-----+---------------+--+
     

    ROLLUP进行聚合

    rollup是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

    select
        user_type,
        sales,
        count(user_id) as pv,
        GROUPING__ID
    from
        order_detail
    group by
        user_type,sales
    WITH ROLLUP
    ORDER BY
        GROUPING__ID;
    +------------+--------+-----+---------------+--+
    | user_type  | sales  | pv  | grouping__id  |
    +------------+--------+-----+---------------+--+
    | NULL       | NULL   | 10  | 0             |
    | old        | NULL   | 3   | 1             |
    | new        | NULL   | 7   | 1             |
    | old        | 3      | 1   | 3             |
    | old        | 2      | 1   | 3             |
    | old        | 1      | 1   | 3             |
    | new        | 6      | 1   | 3             |
    | new        | 5      | 2   | 3             |
    | new        | 3      | 1   | 3             |
    | new        | 2      | 1   | 3             |
    | new        | 1      | 2   | 3             |
    +------------+--------+-----+---------------+--+
  • 相关阅读:
    FMDB(一)— 简单介绍
    产品设计之设计理念
    整理了一下浅墨大神的Visual C++/DirectX 9.0c的游戏开发手记
    使用scp免passwordserver间传递文件
    游戏架构其一:经常使用工具集合
    Failed to import package with error: Couldn't decompress package
    【从0開始Tornado建站】0.9版本号python站点代码开源--持续更新中
    【Android】 给我一个Path,还你一个酷炫动画
    codeforces Round #Pi (div.2) 567ABCD
    linux 查看磁盘使用情况
  • 原文地址:https://www.cnblogs.com/lojun/p/13251478.html
Copyright © 2011-2022 走看看