Hive中提供了越来越多的分析函数,用于完成负责的统计分析。抽时间将所有的分析窗口函数理一遍,将陆续发布。
今天先看几个基础的,SUM、AVG、MIN、MAX。
用于实现分组内所有和连续累积的统计。
1.数据准备
本地文件上的文件
(my_python_env)[root@hadoop26 data]# cat cookie cookie1,2015-04-10,1 cookie1,2015-04-11,5 cookie1,2015-04-12,7 cookie1,2015-04-13,3 cookie1,2015-04-14,2 cookie1,2015-04-15,4 cookie1,2015-04-16,4
吧本地文件系统上的文件上传到hdfs上,并为此数据建立外部表
(my_python_env)[root@hadoop26 data]# hadoop fs -put cookie /cookie hive-> create external table cookie ( cookieid string, createtime string, pv int ) row format delimited fields terminated by ',' stored as textfile location '/cookie';
验证表中是否有数据:
hive (default)> select * from cookie; OK cookie.cookieid cookie.createtime cookie.pv cookie1 2015-04-10 1 cookie1 2015-04-11 5 cookie1 2015-04-12 7 cookie1 2015-04-13 3 cookie1 2015-04-14 2 cookie1 2015-04-15 4 cookie1 2015-04-16 4 Time taken: 0.098 seconds, Fetched: 7 row(s)
窗口统计:
select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime) as pv1,--默认为起点到当前行 sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2,--current row代表当前行 sum(pv) over(partition by cookieid) as pv3,--如果不指定rows between,统计所有行 sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4,--当前行+往前3行 sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5,--当前行+往前3行+往后1行 sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6 from cookie;
统计结果:
OK cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6 cookie1 2015-04-16 4 26 26 26 13 13 4 cookie1 2015-04-15 4 22 22 26 16 20 8 cookie1 2015-04-14 2 18 18 26 17 21 10 cookie1 2015-04-13 3 16 16 26 16 18 13 cookie1 2015-04-12 7 13 13 26 13 16 20 cookie1 2015-04-11 5 6 6 26 6 13 25 cookie1 2015-04-10 1 1 1 26 1 6 26 Time taken: 49.42 seconds, Fetched: 7 row(s)
注意,结果和ORDER BY相关,默认为升序
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
–其他AVG,MIN,MAX,和SUM用法一样。
转自:http://lxw1234.com/archives/2015/04/176.htm