zoukankan      html  css  js  c++  java
  • hive left outer join的问题

    最近BA用户反馈有两句看似很像的语句返回的结果数不一样,比较奇怪,怀疑是不是Hive的Bug

    Query 1 返回结果数6071

    select count(distinct reviewid) as dis_reviewcnt
    from
    (select a.reviewid
    from bi.dpods_dp_reviewreport  a
    left outer join bi.dpods_dp_reviewlog  b
    on a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'  
    where to_date(a.feedadddate) >= '2013-07-01'   and a.hp_statdate='2013-07-24'
    ) a


    Query 2 返回结果数6443

    select count(distinct reviewid) as dis_reviewcnt
    from
    (select a.reviewid
    from bi.dpods_dp_reviewreport  a
    left outer join bi.dpods_dp_reviewlog  b
    on a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'   and a.hp_statdate='2013-07-24'
    where to_date(a.feedadddate) >= '2013-07-01'  
    ) a

    第二条query比第一条多了372条数据,而且在子查询的左表中并不存在

    两条语句唯一的区别是dpods_dp_reviewreport的分区过滤条件(hp_statdate是partition column)一个在where后面,另一个在on后面

    粗看感觉出来的数据应该是一样的,但是玄机其实就在where和on的区别。

    where 后面跟的是过滤条件,query 1 中的a.hp_statdate='2013-07-24', 在table scan之前就会Partition Pruner 过滤分区,所以只有'2013-07-24'下的数据会和dpods_dp_reviewlog进行join。

    而query 2中会读入所有partition下的数据,再和dpods_dp_reviewlog join,并且根据join的关联条件只有a.hp_statdate='2013-07-24'的时候才会真正执行join,其余情况下又由于是left outer join, join不上右面会留NULL,query 2中其实是取出了所有的reviewid,所以会和query 1 结果不一样

    可以做一个实验,query2去掉on后面的a.hp_statdate='2013-07-24',其余不动,执行语句,出来的distinct reviewcnt 也是 6443

    select count(distinct reviewid) as dis_reviewcnt
    from
    (select a.reviewid
    from bi.dpods_dp_reviewreport  a
    left outer join bi.dpods_dp_reviewlog  b
    on a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'
    where to_date(a.feedadddate) >= '2013-07-01'  
    ) a


    query 1的query plan

    ABSTRACT SYNTAX TREE:
      (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (and (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01') (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24'))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt))))
    
    STAGE DEPENDENCIES:
      Stage-5 is a root stage , consists of Stage-1
      Stage-1
      Stage-2 depends on stages: Stage-1
      Stage-0 is a root stage
    
    STAGE PLANS:
      Stage: Stage-5
        Conditional Operator
    
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            a:a 
              TableScan
                alias: a
                Filter Operator
                  predicate:
                      expr: (to_date(feedadddate) >= '2013-07-01')
                      type: boolean
                  Reduce Output Operator
                    key expressions:
                          expr: reviewid
                          type: int
                    sort order: +
                    Map-reduce partition columns:
                          expr: reviewid
                          type: int
                    tag: 0
                    value expressions:
                          expr: feedadddate
                          type: string
                          expr: reviewid
                          type: int
                          expr: hp_statdate
                          type: string
            a:b 
              TableScan
                alias: b
                Reduce Output Operator
                  key expressions:
                        expr: reviewid
                        type: int
                  sort order: +
                  Map-reduce partition columns:
                        expr: reviewid
                        type: int
                  tag: 1
          Reduce Operator Tree:
            Join Operator
              condition map:
                   Left Outer Join0 to 1
              condition expressions:
                0 {VALUE._col5} {VALUE._col8} {VALUE._col17}
                1 
              handleSkewJoin: false
              outputColumnNames: _col5, _col8, _col17
              Select Operator
                expressions:
                      expr: _col8
                      type: int
                outputColumnNames: _col0
                Select Operator
                  expressions:
                        expr: _col0
                        type: int
                  outputColumnNames: _col0
                  Group By Operator
                    aggregations:
                          expr: count(DISTINCT _col0)
                    bucketGroup: false
                    keys:
                          expr: _col0
                          type: int
                    mode: hash
                    outputColumnNames: _col0, _col1
                    File Output Operator
                      compressed: true
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
    
      Stage: Stage-2
        Map Reduce
          Alias -> Map Operator Tree:
            hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-10-59_408_7272696604651905662/-mr-10002 
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                  sort order: +
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(DISTINCT KEY._col0:0._col0)
              bucketGroup: false
              mode: mergepartial
              outputColumnNames: _col0
              Select Operator
                expressions:
                      expr: _col0
                      type: bigint
                outputColumnNames: _col0
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    
      Stage: Stage-0
        Fetch Operator
          limit: -1


    Query 2的query plan

    ABSTRACT SYNTAX TREE:
      (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')) (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01')))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt))))
    
    STAGE DEPENDENCIES:
      Stage-5 is a root stage , consists of Stage-1
      Stage-1
      Stage-2 depends on stages: Stage-1
      Stage-0 is a root stage
    
    STAGE PLANS:
      Stage: Stage-5
        Conditional Operator
    
      Stage: Stage-1
        Map Reduce
          Alias -> Map Operator Tree:
            a:a 
              TableScan
                alias: a
                Filter Operator
                  predicate:
                      expr: (to_date(feedadddate) >= '2013-07-01')
                      type: boolean
                  Reduce Output Operator
                    key expressions:
                          expr: reviewid
                          type: int
                    sort order: +
                    Map-reduce partition columns:
                          expr: reviewid
                          type: int
                    tag: 0
                    value expressions:
                          expr: feedadddate
                          type: string
                          expr: reviewid
                          type: int
                          expr: hp_statdate
                          type: string
            a:b 
              TableScan
                alias: b
                Reduce Output Operator
                  key expressions:
                        expr: reviewid
                        type: int
                  sort order: +
                  Map-reduce partition columns:
                        expr: reviewid
                        type: int
                  tag: 1
          Reduce Operator Tree:
            Join Operator
              condition map:
                   Left Outer Join0 to 1
              condition expressions:
                0 {VALUE._col5} {VALUE._col8}
                1 
              filter predicates:
                0 {(VALUE._col17 = '2013-07-24')}
                1 
              handleSkewJoin: false
              outputColumnNames: _col5, _col8
              Select Operator
                expressions:
                      expr: _col8
                      type: int
                outputColumnNames: _col0
                Select Operator
                  expressions:
                        expr: _col0
                        type: int
                  outputColumnNames: _col0
                  Group By Operator
                    aggregations:
                          expr: count(DISTINCT _col0)
                    bucketGroup: false
                    keys:
                          expr: _col0
                          type: int
                    mode: hash
                    outputColumnNames: _col0, _col1
                    File Output Operator
                      compressed: true
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
    
      Stage: Stage-2
        Map Reduce
          Alias -> Map Operator Tree:
            hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-13-32_879_3623450294049807419/-mr-10002 
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                  sort order: +
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
          Reduce Operator Tree:
            Group By Operator
              aggregations:
                    expr: count(DISTINCT KEY._col0:0._col0)
              bucketGroup: false
              mode: mergepartial
              outputColumnNames: _col0
              Select Operator
                expressions:
                      expr: _col0
                      type: bigint
                outputColumnNames: _col0
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    
      Stage: Stage-0
        Fetch Operator
          limit: -1

    参考:

    http://blog.sina.com.cn/s/blog_6ff05a2c01010oxp.html


  • 相关阅读:
    json~
    ASP.NET经典60道面试题
    C#相关算法_1
    js各种事件
    WebService开发(一) 如何使用Soap头
    使用ASP.NET AJAX异步调用Web Service和页面中的类方法(1):调用Web Service、调用页面中的类方法
    javascriptの一些问题
    ASP.net的身份验证方式有哪些?分别是什么原理?
    如何下载jmeter旧版本
    转账和二维码转账功能测试点
  • 原文地址:https://www.cnblogs.com/jiangu66/p/3221690.html
Copyright © 2011-2022 走看看