zoukankan      html  css  js  c++  java
  • [Hive

    EXPLAIN Syntax

    Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:

    EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
    AUTHORIZATION is supported from HIVE 0.14.0 via HIVE-5961.

    The use of EXTENDED in the EXPLAIN statement produces extra information about the operators in the plan. This is typically physical information like file names.

    A Hive query gets converted into a sequence (it is more an Directed Acyclic Graph) of stages. These stages may be map/reduce stages or they may even be stages that do metastore or file system operations like move and rename. The explain output comprises of three parts:

    • The Abstract Syntax Tree for the query
    • The dependencies between the different stages of the plan
    • The description of each of the stages

    The description of the stages itself shows a sequence of operators with the metadata associated with the operators. The metadata may comprise of things like filter expressions for the FilterOperator or the select expressions for the SelectOperator or the output file names for the FileSinkOperator.

    As an example, consider the following EXPLAIN query:

    EXPLAIN
    FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;

    The output of this statement contains the following parts:

    • The Abstract Syntax Tree

      ABSTRACT SYNTAX TREE:
        (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) (TOK_GROUPBY (TOK_COLREF src key))))
    • The Dependency Graph

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-2 depends on stages: Stage-1
        Stage-0 depends on stages: Stage-2

      This shows that Stage-1 is the root stage, Stage-2 is executed after Stage-1 is done and Stage-0 is executed after Stage-2 is done.

    • The plans of each Stage

      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Alias -> Map Operator Tree:
              src
                  Reduce Output Operator
                    key expressions:
                          expr: key
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: rand()
                          type: double
                    tag: -1
                    value expressions:
                          expr: substr(value, 4)
                          type: string
            Reduce Operator Tree:
              Group By Operator
                aggregations:
                      expr: sum(UDFToDouble(VALUE.0))
                keys:
                      expr: KEY.0
                      type: string
                mode: partial1
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.mapred.SequenceFileOutputFormat
                      name: binary_table
       
        Stage: Stage-2
          Map Reduce
            Alias -> Map Operator Tree:
              /tmp/hive-zshao/67494501/106593589.10001
                Reduce Output Operator
                  key expressions:
                        expr: 0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: 0
                        type: string
                  tag: -1
                  value expressions:
                        expr: 1
                        type: double
            Reduce Operator Tree:
              Group By Operator
                aggregations:
                      expr: sum(VALUE.0)
                keys:
                      expr: KEY.0
                      type: string
                mode: final
                Select Operator
                  expressions:
                        expr: 0
                        type: string
                        expr: 1
                        type: double
                  Select Operator
                    expressions:
                          expr: UDFToInteger(0)
                          type: int
                          expr: 1
                          type: double
                    File Output Operator
                      compressed: false
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                          serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
                          name: dest_g1
       
        Stage: Stage-0
          Move Operator
            tables:
                  replace: true
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
                      name: dest_g1

      In this example there are 2 map/reduce stages (Stage-1 and Stage-2) and 1 File System related stage (Stage-0). Stage-0 basically moves the results from a temporary directory to the directory corresponding to the table dest_g1.

    A map/reduce stage itself comprises of 2 parts:

    • A mapping from table alias to Map Operator Tree - This mapping tells the mappers which operator tree to call in order to process the rows from a particular table or result of a previous map/reduce stage. In Stage-1 in the above example, the rows from src table are processed by the operator tree rooted at a Reduce Output Operator. Similarly, in Stage-2 the rows of the results of Stage-1 are processed by another operator tree rooted at another Reduce Output Operator. Each of these Reduce Output Operators partitions the data to the reducers according to the criteria shown in the metadata.
    • A Reduce Operator Tree - This is the operator tree which processes all the rows on the reducer of the map/reduce job. In Stage-1 for example, the Reducer Operator Tree is carrying out a partial aggregation where as the Reducer Operator Tree in Stage-2 computes the final aggregation from the partial aggregates computed in Stage-1

    The use of DEPENDENCY in the EXPLAIN statement produces extra information about the inputs in the plan. It shows various attributes for the inputs. For example, for a query like:

    EXPLAIN DEPENDENCY
      SELECT key, count(1) FROM srcpart WHERE ds IS NOT NULL GROUP BY key

    the following output is produced:

    {"input_partitions":[{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-08/hr=11"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-08/hr=12"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-09/hr=11"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-09/hr=12"}],"input_tables":[{"tablename":"default@srcpart","tabletype":"MANAGED_TABLE"}]}

    The inputs contain both the tables and the partitions. Note that the table is present even if none of the partitions is accessed in the query.

    The dependencies show the parents in case a table is accessed via a view. Consider the following queries:

    CREATE VIEW V1 AS SELECT key, value from src;
    EXPLAIN DEPENDENCY SELECT * FROM V1;

    The following output is produced:

    {"input_partitions":[],"input_tables":[{"tablename":"default@v1","tabletype":"VIRTUAL_VIEW"},{"tablename":"default@src","tabletype":"MANAGED_TABLE","tableParents":"[default@v1]"}]}

    As above, the inputs contain the view V1 and the table 'src' that the view V1 refers to.

    All the outputs are shown if a table is being accessed via multiple parents.

    CREATE VIEW V2 AS SELECT ds, key, value FROM srcpart WHERE ds IS NOT NULL;
    CREATE VIEW V4 AS
      SELECT src1.key, src2.value as value1, src3.value as value2
      FROM V1 src1 JOIN V2 src2 on src1.key = src2.key JOIN src src3 ON src2.key = src3.key;
    EXPLAIN DEPENDENCY SELECT * FROM V4;

    The following output is produced.

    {"input_partitions":[{"partitionParents":"[default@v2]","partitionName":"default<at:var at:name="srcpart" />ds=2008-04-08/hr=11"},{"partitionParents":"[default@v2]","partitionName":"default<at:var at:name="srcpart" />ds=2008-04-08/hr=12"},{"partitionParents":"[default@v2]","partitionName":"default<at:var at:name="srcpart" />ds=2008-04-09/hr=11"},{"partitionParents":"[default@v2]","partitionName":"default<at:var at:name="srcpart" />ds=2008-04-09/hr=12"}],"input_tables":[{"tablename":"default@v4","tabletype":"VIRTUAL_VIEW"},{"tablename":"default@v2","tabletype":"VIRTUAL_VIEW","tableParents":"[default@v4]"},{"tablename":"default@v1","tabletype":"VIRTUAL_VIEW","tableParents":"[default@v4]"},{"tablename":"default@src","tabletype":"MANAGED_TABLE","tableParents":"[default@v4, default@v1]"},{"tablename":"default@srcpart","tabletype":"MANAGED_TABLE","tableParents":"[default@v2]"}]}

    As can be seen, src is being accessed via parents v1 and v4.

    The use of AUTHORIZATION in the EXPLAIN statement shows all entities needed to be authorized to execute the query and authorization failures if exists. For example, for a query like:

    EXPLAIN AUTHORIZATION
      SELECT * FROM src JOIN srcpart;

    the following output is produced:

    INPUTS:
      default@srcpart
      default@src
      default@srcpart@ds=2008-04-08/hr=11
      default@srcpart@ds=2008-04-08/hr=12
      default@srcpart@ds=2008-04-09/hr=11
      default@srcpart@ds=2008-04-09/hr=12
    OUTPUTS:
      hdfs://localhost:9000/tmp/.../-mr-10000
    CURRENT_USER:
      navis
    OPERATION:
      QUERY
    AUTHORIZATION_FAILURES:
      Permission denied: Principal [name=navis, type=USER] does not have following privileges for operation QUERY [[SELECT] on Object [type=TABLE_OR_VIEW, name=default.src], [SELECT] on Object [type=TABLE_OR_VIEW, name=default.srcpart]]

    With the FORMATTED keyword, it will be returned in JSON format.

    "OUTPUTS":["hdfs://localhost:9000/tmp/.../-mr-10000"],"INPUTS":["default@srcpart","default@src","default@srcpart@ds=2008-04-08/hr=11","default@srcpart@ds=2008-04-08/hr=12","default@srcpart@ds=2008-04-09/hr=11","default@srcpart@ds=2008-04-09/hr=12"],"OPERATION":"QUERY","CURRENT_USER":"navis","AUTHORIZATION_FAILURES":["Permission denied: Principal [name=navis, type=USER] does not have following privileges for operation QUERY [[SELECT] on Object [type=TABLE_OR_VIEW, name=default.src], [SELECT] on Object [type=TABLE_OR_VIEW, name=default.srcpart]]"]}
     
    谨言慎行,专注思考 , 工作与生活同乐
  • 相关阅读:
    矩形法求积分sin cos exp
    约瑟夫环问题
    KMP算法详解
    找出float型数组的最大值和最小值,分别和第一个和最后一个元素互换
    二重指针应用
    C++学习笔记(一)
    Line学习笔记
    node2vec学习笔记
    deepwalk学习笔记
    如何保证消息不被重复消费?(如何保证消息消费时的幂等性)
  • 原文地址:https://www.cnblogs.com/tmeily/p/4250011.html
Copyright © 2011-2022 走看看