hive 窗口函数按照我个人理解,依据hive在map_reduce运算中,通过数据聚合,做好数据清洗和运算,一般在olap(在线数据分析)中事半功倍。
假设以下工作场景:窗口函数的威力你就会见识到:
1、比赛场景中,每场比赛的成绩排名第二位的选手,或者每场比赛成绩排名前五的选手;
2、如果有选手淘汰,每一个选手上次参赛的时间,或者最后第二次参赛的时间;
3、每场比赛,前20%的选手
场景陆续添加中
窗口函数分类
1、SUM,AVG,MIN,MAX
2、NTILE,ROW_NUMBER,RANK,DENSE_RANK
3、CUME_DIST,PERCENT_RANK
4、LAG,LEAD,FIRST_VALUE,LAST_VALUE
5、GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
- RANK() :在分组中排名,相同排名时会留下空位;
- DENSE_RANK() :在分组中排名,相同排名时不会留下空位;
- FIRST_VALUE() :分组内排序取第一个值;
- LAST_VALUE() :分组内排序取最后一个值;
- NTILE(n) :将分组数据按顺序切分成n份,返回当前所在切片;
- ROW_NUMGBER() :在分组中从1开始按序记录序列;
- CUME_DIST() :小于等于当前值的行数 / 分组总行数(百分比);
- PERCENT_RANK() :(分组内的RANK值-1) / (分组内总数-1);
- LAG(col, n, DEFAULT) :在统计窗口内从下往上取第n行的值;
- LEAD(col, n, DEFAULT):在统计窗口内从上往下取第n行的值;
第一步:先理解 partition by字句。
hive 属于分区查询,当然还是在map_reduce运算上做文章。Partition By子句也可以称为查询分区子句,非常类似于Group By,都是将数据按照边界值分组,而Over之前的函数在每一个分组之内进行,如果超出了分组,则函数会重新计算。
大牛案例:按照月份聚合,并非按照name,sum按照每月的总cost求和
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from t_window
name orderdate cost sum_window_0 jack 2015-01-01 10 205 jack 2015-01-08 55 205 tony 2015-01-07 50 205 jack 2015-01-05 46 205 tony 2015-01-04 29 205 tony 2015-01-02 15 205 jack 2015-02-03 23 23 mart 2015-04-13 94 341 jack 2015-04-06 42 341 mart 2015-04-11 75 341 mart 2015-04-09 68 341 mart 2015-04-08 62 341 neil 2015-05-10 12 12
第二步:理解查询语句内部顺序:
- 如果只使用partition by子句,未指定order by的话,我们的聚合是分组内的聚合.
- 使用了order by子句,未使用window子句的情况下,默认从起点到当前行
window子句:
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
- CURRENT ROW:当前行
- UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
大牛案例:
select name,orderdate,cost, sum(cost) over() as sample1,--所有行相加 sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加 sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加 sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 from t_window
查询结果
name orderdate cost sample1 sample2 sample3 sample4 sample5 sample6 sample7
jack 2015-01-01 10 661 176 10 10 10 56 176
jack 2015-01-05 46 661 176 56 56 56 111 166
jack 2015-01-08 55 661 176 111 111 101 124 120
jack 2015-02-03 23 661 176 134 134 78 120 65
jack 2015-04-06 42 661 176 176 176 65 65 42
mart 2015-04-08 62 661 299 62 62 62 130 299
mart 2015-04-09 68 661 299 130 130 130 205 237
mart 2015-04-11 75 661 299 205 205 143 237 169
mart 2015-04-13 94 661 299 299 299 169 169 94
neil 2015-05-10 12 661 92 12 12 12 92 92
neil 2015-06-12 80 661 92 92 92 92 92 80
tony 2015-01-02 15 661 94 15 15 15 44 94
tony 2015-01-04 29 661 94 44 44 44 94 79
第三步:八仙过海,各显神通。
下面是各个窗口函数的骚操作,车速较高,扶好站稳。
1、排名前几位的选手——类似问题,前20%的选手
NTILE,ROW_NUMBER,RANK,DENSE_RANK
- row_number()从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列
--rank() 跳跃排序,如果有两个第一级别时,接下来是第三级别
--dense_rank() 连续排序,如果有两个第一级别时,接下来是第二级别
**注意:
rank和dense_rank的区别在于排名相等时会不会留下空位
SELECT cookieid, createtime, pv, RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1, DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 FROM lxw1234 WHERE cookieid = 'cookie1'; 查询结果: cookieid day pv rn1 rn2 rn3 cookie1 2015-04-12 7 1 1 1 cookie1 2015-04-11 5 2 2 2 cookie1 2015-04-15 4 3 3 3 cookie1 2015-04-16 4 3 3 4 cookie1 2015-04-13 3 5 4 5 cookie1 2015-04-14 2 6 5 6
查询前20%的记录
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
注意:n必须为int类型。
select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate) sorted from business) t where sorted = 1;
2、上一次记录
LAG,LEAD,FIRST_VALUE,LAST_VALUE
LAG(col,n):往前第n行数据
LEAD(col,n):往后第n行数据
select name,orderdate,cost, lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 from t_window 查询结果: name orderdate cost time1 time2 jack 2015-01-01 10 1900-01-01 NULL jack 2015-01-05 46 2015-01-01 NULL jack 2015-01-08 55 2015-01-05 2015-01-01 jack 2015-02-03 23 2015-01-08 2015-01-05 jack 2015-04-06 42 2015-02-03 2015-01-08 mart 2015-04-08 62 1900-01-01 NULL mart 2015-04-09 68 2015-04-08 NULL mart 2015-04-11 75 2015-04-09 2015-04-08 mart 2015-04-13 94 2015-04-11 2015-04-09
first_value取分组内排序后,截止到当前行,第一个值
last_value取分组内排序后,截止到当前行,最后一个值
select name,orderdate,cost, first_value(orderdate) over(partition by name order by orderdate) as time1, last_value(orderdate) over(partition by name order by orderdate) as time2 from t_window 查询结果 name orderdate cost time1 time2 jack 2015-01-01 10 2015-01-01 2015-01-01 jack 2015-01-05 46 2015-01-01 2015-01-05 jack 2015-01-08 55 2015-01-01 2015-01-08 jack 2015-02-03 23 2015-01-01 2015-02-03 jack 2015-04-06 42 2015-01-01 2015-04-06 mart 2015-04-08 62 2015-04-08 2015-04-08 mart 2015-04-09 68 2015-04-08 2015-04-09 mart 2015-04-11 75 2015-04-08 2015-04-11 mart 2015-04-13 94 2015-04-08 2015-04-13 neil 2015-05-10 12 2015-05-10 2015-05-10 neil 2015-06-12 80 2015-05-10 2015-06-12
本文参考案例: