zoukankan      html  css  js  c++  java
  • Hive group by实现-就是word 统计

    准备数据

    SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
    hive> SELECT * FROM logs;
    a	苹果	5
    a	橙子	3
    a      苹果   2
    b	烧鸡	1
     
    hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
    a	10
    b	1

    计算过程

    hive-groupby-cal
    默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.

    Operator

    hive-groupby-op

    Explain

    hive> explain SELECT uid, sum(count) FROM logs group by uid;
    OK
    ABSTRACT SYNTAX TREE:
      (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))
     
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 is a root stage
     
    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            logs 
              TableScan // 扫描表
                alias: logs
                Select Operator //选择字段
                  expressions:
                        expr: uid
                        type: string
                        expr: count
                        type: int
                  outputColumnNames: uid, count
                  Group By Operator //这里是因为默认设置了hive.map.aggr=true,会在mapper先做一次聚合,减少reduce需要处理的数据
                    aggregations:
                          expr: sum(count) //聚集函数
                    bucketGroup: false
                    keys: //键
                          expr: uid
                          type: string
                    mode: hash //hash方式,processHashAggr()
                    outputColumnNames: _col0, _col1
                    Reduce Output Operator //输出key,value给reducer
                      key expressions:
                            expr: _col0
                            type: string
                      sort order: +
                      Map-reduce partition columns:
                            expr: _col0
                            type: string
                      tag: -1
                      value expressions:
                            expr: _col1
                            type: bigint
          Reduce Operator Tree:
            Group By Operator
     
              aggregations:
                    expr: sum(VALUE._col0)
    //聚合
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: string
              mode: mergepartial //合并值
              outputColumnNames: _col0, _col1
              Select Operator //选择字段
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                File Output Operator //输出到文件
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
     
      Stage: Stage-0
        Fetch Operator
          limit: -1
  • 相关阅读:
    LeetCode对撞指针汇总
    167. Two Sum II
    215. Kth Largest Element in an Array
    2018Action Recognition from Skeleton Data via Analogical Generalization over Qualitative Representations
    题解 Educational Codeforces Round 84 (Rated for Div. 2) (CF1327)
    题解 JZPKIL
    题解 八省联考2018 / 九省联考2018
    题解 六省联考2017
    题解 Codeforces Round #621 (Div. 1 + Div. 2) (CF1307)
    题解Codeforces Round #620 (Div. 2)
  • 原文地址:https://www.cnblogs.com/bonelee/p/6359615.html
Copyright © 2011-2022 走看看