zoukankan      html  css  js  c++  java
  • Hive group by实现-就是word 统计

    准备数据

    SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
    hive> SELECT * FROM logs;
    a	苹果	5
    a	橙子	3
    a      苹果   2
    b	烧鸡	1
     
    hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
    a	10
    b	1

    计算过程

    hive-groupby-cal
    默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.

    Operator

    hive-groupby-op

    Explain

    hive> explain SELECT uid, sum(count) FROM logs group by uid;
    OK
    ABSTRACT SYNTAX TREE:
      (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))
     
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 is a root stage
     
    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            logs 
              TableScan // 扫描表
                alias: logs
                Select Operator //选择字段
                  expressions:
                        expr: uid
                        type: string
                        expr: count
                        type: int
                  outputColumnNames: uid, count
                  Group By Operator //这里是因为默认设置了hive.map.aggr=true,会在mapper先做一次聚合,减少reduce需要处理的数据
                    aggregations:
                          expr: sum(count) //聚集函数
                    bucketGroup: false
                    keys: //键
                          expr: uid
                          type: string
                    mode: hash //hash方式,processHashAggr()
                    outputColumnNames: _col0, _col1
                    Reduce Output Operator //输出key,value给reducer
                      key expressions:
                            expr: _col0
                            type: string
                      sort order: +
                      Map-reduce partition columns:
                            expr: _col0
                            type: string
                      tag: -1
                      value expressions:
                            expr: _col1
                            type: bigint
          Reduce Operator Tree:
            Group By Operator
     
              aggregations:
                    expr: sum(VALUE._col0)
    //聚合
              bucketGroup: false
              keys:
                    expr: KEY._col0
                    type: string
              mode: mergepartial //合并值
              outputColumnNames: _col0, _col1
              Select Operator //选择字段
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                File Output Operator //输出到文件
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
     
      Stage: Stage-0
        Fetch Operator
          limit: -1
  • 相关阅读:
    noip模拟赛 钻石
    noip模拟赛 整除
    noip模拟赛 拼不出的数
    noip模拟赛 正方形
    noip模拟赛 财富
    noip模拟赛 a
    Java基础知识强化23:Java中数据类型转换(面试题)
    Java基础知识强化22:Java中数据类型转换
    Java基础知识强化21:Java中length、length()、size()区别
    MySQL(12):windows下解决mysql忘记密码
  • 原文地址:https://www.cnblogs.com/bonelee/p/6359615.html
Copyright © 2011-2022 走看看