zoukankan html css js c++ java

Hive新功能 Cube, Rollup介绍

说明：Hive之cube、rollup，还有窗口函数，在传统关系型数据（Oracle、sqlserver）中都是有的，用法都很相似。

GROUPING SETS

GROUPING SETS作为GROUP BY的子句，允许开发人员在GROUP BY语句后面指定多个统计选项，可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来，下面是几个实例可以帮助我们了解，

以acorn_3g.test_xinyan_reg为例：

[dp@YZSJHL19-87 xjob]$ hive -e "use acorn_3g;desc test_xinyan_reg;"
user_id                 bigint                  None                 
device_id               int                     None   手机，平板             
os_id                   int                     None   操作系统类型             
app_id                  int                     None   手机app_id             
client_version          string                  None   客户端版本             
from_id                 int                     None   四级渠道

几个demo帮助大家了解：

grouping sets语句	等价hive语句
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id))	SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id))	SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id))	SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id UNION ALL SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),())	SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id UNION ALL SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id UNION ALL SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id UNION ALL SELECT null,null,null,count(user_id) FROM test_xinyan_reg

CUBE函数

cube简称数据魔方，可以实现hive多个任意维度的查询，cube(a,b,c)则首先会对(a,b,c)进行group by，然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by，他会统计所选列中值的所有组合的聚合

select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id with cube;

手工实现需要写的hql语句（写个程序自己生成的，手写累死）：

SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version
UNION ALL
SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version
UNION ALL
SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version
UNION ALL
SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version
UNION ALL
SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version
UNION ALL
SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version
UNION ALL
SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version
UNION ALL
SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version
UNION ALL
SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id
UNION ALL
SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id
UNION ALL
SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id
UNION ALL
SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id
UNION ALL
SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id
UNION ALL
SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id
UNION ALL
SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id
UNION ALL
SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id
UNION ALL
SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id
UNION ALL
SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id
UNION ALL
SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id
UNION ALL
SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id
UNION ALL
SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id
UNION ALL
SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg

看着很蛋疼是不是，体会到cube的强大了吗！(低版本hive可以通过union all方式解决，算是没有办法的办法)

ROLL UP函数

rollup可以实现从右到做递减多级的统计，显示统计某一层次结构的聚合。

 select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id with rollup;

等价以下sql语句：

 select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id 
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

Grouping_ID函数

当我们没有统计某一列时，它的值显示为null，这可能与列本身就有null值冲突，这就需要一种方法区分是没有统计还是值本来就是null。（写一个排列组合的算法，就马上理解了，grouping_id其实就是所统计各列二进制和）

直接拿官方文档一个例子，O(∩_∩)O哈哈~

Column1 (key)	Column2 (value)
1	NULL
1	1
2	2
3	3
3	NULL
4	5

hql统计：

  SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP

统计结果如下：


NULL	NULL	0 00	6
1	NULL	1 10	2
1	NULL	3 11	1
1	1	3 11	1
2	NULL	1 10	1
2	2	3 11	1
3	NULL	1 10	2
3	NULL	3 11	1
3	3	3 11	1
4	NULL	1 10	1
4	5	3 11	1

GROUPING__ID转变为二进制，如果对应位上有值为null，说明这列本身值就是null。（通过类DataFilterNull.py 扫描，可以筛选过滤掉列中null、“”统计结果),

窗口函数

hive窗口函数，感觉大部分都是在模仿oracle，有对oracle熟悉的，应该看下就知道怎么用。

具体参见：http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

主要围绕..over( partitoin by ..) ..

3g业务求新增激活时候，有的一部手机，可能注册多个渠道，这时候就要按时间顺序求第一个：

select f.udid,f.from_id,f.ins_date 
from (select /* +MAPJOIN(u) */ u.device_id as udid ,g.device_id as gdid,u.from_id,u.ins_date,row_number() over (partition by u.device_id order by u.ins_date asc) as row_number 
        from user_device_info u  
        left outer join  (select device_id from 3g_device_id where log_date<'2013-07-25') g  on ( u.device_id = g.device_id ) 
        where u.log_date='2013-07-25' and u.device_id is not null and u.device_id <> '') f  
where f.gdid is null and row_number=1

参考资料

apache hive窗口函数官方介绍：http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

apache hive官方：cube、rollup函数介绍：https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup

oracle窗口函数介绍：http://www.blogjava.net/pengpenglin/archive/2012/04/12/211334.html

参考地址：http://blog.csdn.net/moon_yang_bj/article/details/17200367

查看全文

相关阅读:
组合和封装
 面向对象之继承与派生
 实例化产生对象和绑定关系
 面向对象程序设计和类
 常用模块（hashlib、suprocess、configparser）
常用模块re模块(正则表达式)
包及常用模块（time、datetime、random、sys）
python内置函数大全
 软件开发规范及常用模块
 [NOI Online 提高组]序列 (并查集+二分图判定)

原文地址：https://www.cnblogs.com/xiohao/p/6404000.html