zoukankan      html  css  js  c++  java
  • hive grouping sets和GROUPING__ID的用法

    GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

    这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。

    grouping sets根据不同的维度组合进行聚合,等价于将不同维度的group by的结果进行 union all,简单来说就是将多个不同维度的group by逻辑写在了 一个sql中。

    数据准备:

    vim /root/test.txt
    2015-03,2015-03-10,cookie1
    2015-03,2015-03-10,cookie5
    2015-03,2015-03-12,cookie7
    2015-04,2015-04-12,cookie3
    2015-04,2015-04-13,cookie2
    2015-04,2015-04-13,cookie4
    2015-04,2015-04-16,cookie4
    2015-03,2015-03-10,cookie2
    2015-03,2015-03-10,cookie3
    2015-04,2015-04-12,cookie5
    2015-04,2015-04-13,cookie6
    2015-04,2015-04-15,cookie3
    2015-04,2015-04-15,cookie2
    2015-04,2015-04-16,cookie1

    将数据导入到hdfs目录上:

    hdfs dfs -put /root/test.txt /tmp

    创建表

    use test;

    create table cookie5(month string, day string, cookieid string) row format delimited fields terminated by ',';

    load data inpath
    "/tmp/test.txt" into table cookie5;

    select
    * from cookie5;

    开始使用grouping sets

    来条sql语句:

    select 
        month,day,count(cookieid) 
    from cookie5 
        group by month,day 
    grouping sets (month,day);

    查询结果如下:

    上面这个sql等同于多个group by + union all

    select month,NULL as day,count(cookieid) as nums from cookie5 group by month
    union all
    select NULL as month,day,count(cookieid) as nums from cookie5 group by day;

    注意点:使用union和union all必须保证各个select 集合的结果有相同个数的列,并且每个列的类型是一样的。union all的表字段必须匹配,也就是上文的month 需要用 NULL as month 来进行填充。

    结果一致!但是grouping sets速度要比group by + union 快!!!!!

    但是有引出一个问题?为什么grouping sets 要比 group by + union 速度要快?

    首先:用explain来解释下hive执行计划:

    explain select 
        month,day,count(cookieid) 
    from cookie5 
        group by month,day 
    grouping sets (month,day);

    执行计划如下:

    +----------------------------------------------------+--+
    |                      Explain                       |
    +----------------------------------------------------+--+
    | STAGE DEPENDENCIES:                                |
    |   Stage-1 is a root stage                          |
    |   Stage-0 depends on stages: Stage-1               |
    |                                                    |
    | STAGE PLANS:                                       |
    |   Stage: Stage-1                                   |
    |     Map Reduce                                     |
    |       Map Operator Tree:                           |
    |           TableScan                                |
    |             alias: cookie5                         |
    |             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |             Select Operator                        |
    |               expressions: month (type: string), day (type: string), cookieid (type: string) |
    |               outputColumnNames: month, day, cookieid |
    |               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |               Group By Operator                    |
    |                 aggregations: count(cookieid)      |
    |                 keys: month (type: string), day (type: string), '0' (type: string) |
    |                 mode: hash                         |
    |                 outputColumnNames: _col0, _col1, _col2, _col3 |
    |                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
    |                 Reduce Output Operator             |
    |                   key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    |                   sort order: +++                  |
    |                   Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    |                   Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
    |                   value expressions: _col3 (type: bigint) |
    |       Reduce Operator Tree:                        |
    |         Group By Operator                          |
    |           aggregations: count(VALUE._col0)         |
    |           keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
    |           mode: mergepartial                       |
    |           outputColumnNames: _col0, _col1, _col3   |
    |           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |           pruneGroupingSetId: true                 |
    |           Select Operator                          |
    |             expressions: _col0 (type: string), _col1 (type: string), _col3 (type: bigint) |
    |             outputColumnNames: _col0, _col1, _col2 |
    |             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |             File Output Operator                   |
    |               compressed: false                    |
    |               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |               table:                               |
    |                   input format: org.apache.hadoop.mapred.TextInputFormat |
    |                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    |                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    |                                                    |
    |   Stage: Stage-0                                   |
    |     Fetch Operator                                 |
    |       limit: -1                                    |
    |       Processor Tree:                              |
    |         ListSink    

     再看另外一个:

    explain select month,NULL as day,count(cookieid) as nums from cookie5 group by month 
    union all 
    select NULL as month,day,count(cookieid) as nums from cookie5 group by day;
    +----------------------------------------------------+--+
    |                      Explain                       |
    +----------------------------------------------------+--+
    | STAGE DEPENDENCIES:                                |
    |   Stage-1 is a root stage                          |
    |   Stage-2 depends on stages: Stage-1, Stage-3      |
    |   Stage-3 is a root stage                          |
    |   Stage-0 depends on stages: Stage-2               |
    |                                                    |
    | STAGE PLANS:                                       |
    |   Stage: Stage-1                                   |
    |     Map Reduce                                     |
    |       Map Operator Tree:                           |
    |           TableScan                                |
    |             alias: cookie5                         |
    |             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |             Select Operator                        |
    |               expressions: month (type: string), cookieid (type: string) |
    |               outputColumnNames: month, cookieid   |
    |               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |               Group By Operator                    |
    |                 aggregations: count(cookieid)      |
    |                 keys: month (type: string)         |
    |                 mode: hash                         |
    |                 outputColumnNames: _col0, _col1    |
    |                 Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |                 Reduce Output Operator             |
    |                   key expressions: _col0 (type: string) |
    |                   sort order: +                    |
    |                   Map-reduce partition columns: _col0 (type: string) |
    |                   Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |                   value expressions: _col1 (type: bigint) |
    |       Reduce Operator Tree:                        |
    |         Group By Operator                          |
    |           aggregations: count(VALUE._col0)         |
    |           keys: KEY._col0 (type: string)           |
    |           mode: mergepartial                       |
    |           outputColumnNames: _col0, _col1          |
    |           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |           Select Operator                          |
    |             expressions: _col0 (type: string), UDFToString(null) (type: string), _col1 (type: bigint) |
    |             outputColumnNames: _col0, _col1, _col2 |
    |             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |             File Output Operator                   |
    |               compressed: false                    |
    |               table:                               |
    |                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    |                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    |                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
    |                                                    |
    |   Stage: Stage-2                                   |
    |     Map Reduce                                     |
    |       Map Operator Tree:                           |
    |           TableScan                                |
    |             Union                                  |
    |               Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
    |               File Output Operator                 |
    |                 compressed: false                  |
    |                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
    |                 table:                             |
    |                     input format: org.apache.hadoop.mapred.TextInputFormat |
    |                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    |                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    |           TableScan                                |
    |             Union                                  |
    |               Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
    |               File Output Operator                 |
    |                 compressed: false                  |
    |                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
    |                 table:                             |
    |                     input format: org.apache.hadoop.mapred.TextInputFormat |
    |                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    |                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    |                                                    |
    |   Stage: Stage-3                                   |
    |     Map Reduce                                     |
    |       Map Operator Tree:                           |
    |           TableScan                                |
    |             alias: cookie5                         |
    |             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |             Select Operator                        |
    |               expressions: day (type: string), cookieid (type: string) |
    |               outputColumnNames: day, cookieid     |
    |               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |               Group By Operator                    |
    |                 aggregations: count(cookieid)      |
    |                 keys: day (type: string)           |
    |                 mode: hash                         |
    |                 outputColumnNames: _col0, _col1    |
    |                 Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |                 Reduce Output Operator             |
    |                   key expressions: _col0 (type: string) |
    |                   sort order: +                    |
    |                   Map-reduce partition columns: _col0 (type: string) |
    |                   Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |                   value expressions: _col1 (type: bigint) |
    |       Reduce Operator Tree:                        |
    |         Group By Operator                          |
    |           aggregations: count(VALUE._col0)         |
    |           keys: KEY._col0 (type: string)           |
    |           mode: mergepartial                       |
    |           outputColumnNames: _col0, _col1          |
    |           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    +----------------------------------------------------+--+
    |                      Explain                       |
    +----------------------------------------------------+--+
    |           Select Operator                          |
    |             expressions: UDFToString(null) (type: string), _col0 (type: string), _col1 (type: bigint) |
    |             outputColumnNames: _col0, _col1, _col2 |
    |             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
    |             File Output Operator                   |
    |               compressed: false                    |
    |               table:                               |
    |                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    |                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    |                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
    |                                                    |
    |   Stage: Stage-0                                   |
    |     Fetch Operator                                 |
    |       limit: -1                                    |
    |       Processor Tree:                              |
    |         ListSink                                   |
    |                                                    |
    +----------------------------------------------------+--+

    比较:通过explain ,当用grouping sets时,只有2个stage,只有一次reduce,而当用group by + union时,有4个stage,发生了两次reduce。那么肯定用grouping sets时,速度会快。

    GROUPING__ID的使用:

    来条sql语句:

    select 
      month,
      day,
      count(distinct cookieid) as uv,
      GROUPING__ID
    from cookie5 
    group by month,day 
    grouping sets (month,day) 
    order by GROUPING__ID;

    效果等价于:

    SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month
    UNION ALL 
    SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day;

    结果说明

    第一列是按照month进行分组

    第二列是按照day进行分组

    第三列是按照month或day分组是,统计这一组有几个不同的cookieid

    第四列grouping_id表示这一组结果属于哪个分组集合,根据grouping sets中的分组条件month,day,1是代表month,2是代表day

    再来个例子:

    SELECT  month, day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID 
    FROM cookie5 
    GROUP BY month,day 
    GROUPING SETS (month,day,(month,day)) 
    ORDER BY GROUPING__ID;

    等价于:

    SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month 
    UNION ALL 
    SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day
    UNION ALL 
    SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day;

    GROUPING SETS (month,day,(month,day)) 这个的意思是 按三个维度来进行统计 1.按月 2 按天 3 按 月和天 ,结果也能证明这一个观点。

    结束! 转自:https://www.cnblogs.com/qingyunzong/p/8798987.html#_label2
  • 相关阅读:
    在Win10中通过命令行打开UWP应用
    前端学习Docker
    Puppeteer的使用
    taro教程
    22种开源Vue模板和主题框架「干货」
    36种免费React模板和主题「干货」
    移动端1px显示异常解决方案
    前端性能优化(二)
    Vue.set()和this.$set()源码解析
    2018年8月7日 乐视2 X621 刷机包下载链接
  • 原文地址:https://www.cnblogs.com/erlou96/p/13564191.html
Copyright © 2011-2022 走看看