zoukankan      html  css  js  c++  java
  • 【Hive学习之八】Hive 调优【重要】

    环境
      虚拟机:VMware 10
      Linux版本:CentOS-6.5-x86_64
      客户端:Xshell4
      FTP:Xftp4
      jdk8
      hadoop-3.1.1
      apache-hive-3.1.1

    一、执行计划
    核心思想:把Hive SQL当做Mapreduce程序去优化
    以下SQL不会转为Mapreduce来执行
      -select仅查询本表字段
      -where仅对本表字段做条件过滤

    Explain 显示执行计划:EXPLAIN [EXTENDED] query

    hive> explain select count(*) from psn2;
    OK
    Explain
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 depends on stages: Stage-1
    
    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Map Operator Tree:
              TableScan
                alias: psn2
                Statistics: Num rows: 2 Data size: 7440 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  Statistics: Num rows: 2 Data size: 7440 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count()
                    mode: hash
                    outputColumnNames: _col0
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      sort order: 
                      Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                      value expressions: _col0 (type: bigint)
          Execution mode: vectorized
          Reduce Operator Tree:
            Group By Operator
              aggregations: count(VALUE._col0)
              mode: mergepartial
              outputColumnNames: _col0
              Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            ListSink
    
    Time taken: 2.7 seconds, Fetched: 43 row(s)
    hive>
    hive> explain extended select count(*) from psn2;
    OK
    Explain
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 depends on stages: Stage-1
    
    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Map Operator Tree:
              TableScan
                alias: psn2
                Statistics: Num rows: 2 Data size: 7440 Basic stats: COMPLETE Column stats: NONE
                GatherStats: false
                Select Operator
                  Statistics: Num rows: 2 Data size: 7440 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count()
                    mode: hash
                    outputColumnNames: _col0
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      null sort order: 
                      sort order: 
                      Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                      tag: -1
                      value expressions: _col0 (type: bigint)
                      auto parallelism: false
          Execution mode: vectorized
          Path -> Alias:
            hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10 [psn2]
            hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20 [psn2]
          Path -> Partition:
            hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10 
              Partition
                base file name: age=10
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                partition values:
                  age 10
                properties:
                  bucket_count -1
                  collection.delim -
                  column.name.delimiter ,
                  columns id,name,likes,address
                  columns.comments 
                  columns.types int:string:array<string>:map<string,string>
                  field.delim ,
                  file.inputformat org.apache.hadoop.mapred.TextInputFormat
                  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  line.delim 
    
                  location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10
                  mapkey.delim :
                  name default.psn2
                  numFiles 1
                  numRows 0
                  partition_columns age
                  partition_columns.types int
                  rawDataSize 0
                  serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}
                  serialization.format ,
                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  totalSize 372
                  transient_lastDdlTime 1548986286
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    bucket_count -1
                    bucketing_version 2
                    collection.delim -
                    column.name.delimiter ,
                    columns id,name,likes,address
                    columns.comments 
                    columns.types int:string:array<string>:map<string,string>
                    field.delim ,
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    line.delim 
    
                    location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2
                    mapkey.delim :
                    name default.psn2
                    partition_columns age
                    partition_columns.types int
                    serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}
                    serialization.format ,
                    serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    transient_lastDdlTime 1548915997
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: default.psn2
                name: default.psn2
            hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20 
              Partition
                base file name: age=20
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                partition values:
                  age 20
                properties:
                  bucket_count -1
                  collection.delim -
                  column.name.delimiter ,
                  columns id,name,likes,address
                  columns.comments 
                  columns.types int:string:array<string>:map<string,string>
                  field.delim ,
                  file.inputformat org.apache.hadoop.mapred.TextInputFormat
                  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  line.delim 
    
                  location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20
                  mapkey.delim :
                  name default.psn2
                  numFiles 1
                  numRows 0
                  partition_columns age
                  partition_columns.types int
                  rawDataSize 0
                  serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}
                  serialization.format ,
                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  totalSize 372
                  transient_lastDdlTime 1548986540
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    bucket_count -1
                    bucketing_version 2
                    collection.delim -
                    column.name.delimiter ,
                    columns id,name,likes,address
                    columns.comments 
                    columns.types int:string:array<string>:map<string,string>
                    field.delim ,
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    line.delim 
    
                    location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2
                    mapkey.delim :
                    name default.psn2
                    partition_columns age
                    partition_columns.types int
                    serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}
                    serialization.format ,
                    serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    transient_lastDdlTime 1548915997
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: default.psn2
                name: default.psn2
          Truncated Path -> Alias:
            /psn2/age=10 [psn2]
            /psn2/age=20 [psn2]
          Needs Tagging: false
          Reduce Operator Tree:
            Group By Operator
              aggregations: count(VALUE._col0)
              mode: mergepartial
              outputColumnNames: _col0
              Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                GlobalTableId: 0
                directory: hdfs://PCS102:9820/tmp/hive/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_10-58-42_159_2637812497308639143-1/-mr-10001/.hive-staging_hive_2019-02-19_10-58-42_159_2637812497308639143-1/-ext-10002
                NumFilesPerFileSink: 1
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                Stats Publishing Key Prefix: hdfs://PCS102:9820/tmp/hive/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_10-58-42_159_2637812497308639143-1/-mr-10001/.hive-staging_hive_2019-02-19_10-58-42_159_2637812497308639143-1/-ext-10002/
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    properties:
                      columns _col0
                      columns.types bigint
                      escape.delim 
                      hive.serialization.extend.additional.nesting.levels true
                      serialization.escape.crlf true
                      serialization.format 1
                      serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                TotalFiles: 1
                GatherStats: false
                MultiFileSpray: false
    
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            ListSink
    
    Time taken: 0.142 seconds, Fetched: 192 row(s)
    hive> 

    二、运行模式
    (1)分为 本地模式 和 集群模式
    (2)开启本地模式(对于数据量少的表情况):
      set hive.exec.mode.local.auto=true;
    注意:
      hive.exec.mode.local.auto.inputbytes.max默认值为128M
      表示加载文件的最大值,若大于该配置仍会以集群方式来运行!

    hive> set hive.exec.mode.local.auto=true;
    hive> select count(*)  from psn21;
    Automatically selecting local only mode for query
    Query ID = root_20190219144810_0bafff9e-1c40-45f6-b687-60c5d13c9f0c
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Job running in-process (local Hadoop)
    2019-02-19 14:48:11,839 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_local1827024396_0002
    MapReduce Jobs Launched: 
    Stage-Stage-1:  HDFS Read: 4702 HDFS Write: 623 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    _c0
    6
    Time taken: 1.376 seconds, Fetched: 1 row(s)
    hive> set hive.exec.mode.local.auto=false;
    hive> select count(*)  from psn21;
    Query ID = root_20190219144841_6fd11106-5db1-4335-8b0b-884697b558df
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1548397153910_0013, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0013/
    Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0013
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2019-02-19 14:48:49,046 Stage-1 map = 0%,  reduce = 0%
    2019-02-19 14:48:54,210 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.87 sec
    2019-02-19 14:48:59,328 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.28 sec
    MapReduce Total cumulative CPU time: 6 seconds 280 msec
    Ended Job = job_1548397153910_0013
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.28 sec   HDFS Read: 13976 HDFS Write: 101 SUCCESS
    Total MapReduce CPU Time Spent: 6 seconds 280 msec
    OK
    _c0
    6
    Time taken: 18.923 seconds, Fetched: 1 row(s)
    hive> 

    三、并行计算
    通过设置以下参数开启并行模式(需要关闭本地模式):
    set hive.exec.parallel=true;

    另:hive.exec.parallel.thread.number:一次SQL计算中允许并行执行的job个数的最大值

    hive> set hive.exec.parallel;
    hive.exec.parallel=false
    hive> select t1.cnt1,t2.cnt2 from 
        > (select count(id) cnt1 from psn21) t1,
        > (select count(name) cnt2 from psn21)t2;
    Warning: Map Join MAPJOIN[27][bigTable=?] in task 'Stage-4:MAPRED' is a cross product
    Warning: Map Join MAPJOIN[34][bigTable=?] in task 'Stage-5:MAPRED' is a cross product
    Warning: Shuffle Join JOIN[14][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product
    Query ID = root_20190219145608_b4f3d4e9-b858-41be-9ddc-6eccff0ec9d9
    Total jobs = 5
    Launching Job 1 out of 5
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1548397153910_0014, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0014/
    Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0014
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2019-02-19 14:56:15,027 Stage-1 map = 0%,  reduce = 0%
    2019-02-19 14:56:20,148 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.85 sec
    2019-02-19 14:56:25,277 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.04 sec
    MapReduce Total cumulative CPU time: 6 seconds 40 msec
    Ended Job = job_1548397153910_0014
    Launching Job 2 out of 5
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1548397153910_0015, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0015/
    Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0015
    Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
    2019-02-19 14:56:37,661 Stage-3 map = 0%,  reduce = 0%
    2019-02-19 14:56:42,786 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.8 sec
    2019-02-19 14:56:46,875 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 5.92 sec
    MapReduce Total cumulative CPU time: 5 seconds 920 msec
    Ended Job = job_1548397153910_0015
    Stage-7 is selected by condition resolver.
    Stage-8 is filtered out by condition resolver.
    Stage-2 is filtered out by condition resolver.
    SLF4J: Found binding in [jar:file:/usr/local/apache-hive-3.1.1-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    2019-02-19 14:56:55    Starting to launch local task to process map join;    maximum memory = 239075328
    2019-02-19 14:56:56    Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_14-56-08_997_6748376838876035123-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile01--.hashtable2019-02-19 14:56:56    Uploaded 1 File to: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_14-56-08_997_6748376838876035123-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (278 bytes)
    
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 4 out of 5
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1548397153910_0016, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0016/
    Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0016
    Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
    2019-02-19 14:57:02,370 Stage-4 map = 0%,  reduce = 0%
    2019-02-19 14:57:07,478 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 3.07 sec
    MapReduce Total cumulative CPU time: 3 seconds 70 msec
    Ended Job = job_1548397153910_0016
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.04 sec   HDFS Read: 13637 HDFS Write: 114 SUCCESS
    Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.92 sec   HDFS Read: 13641 HDFS Write: 114 SUCCESS
    Stage-Stage-4: Map: 1   Cumulative CPU: 3.07 sec   HDFS Read: 6091 HDFS Write: 103 SUCCESS
    Total MapReduce CPU Time Spent: 15 seconds 30 msec
    OK
    t1.cnt1    t2.cnt2
    6    6
    Time taken: 59.527 seconds, Fetched: 1 row(s)
    hive> set hive.exec.parallel=true;
    hive> (select count(name) cnt2 from psn21)t2;
    FAILED: ParseException line 1:36 extraneous input 't2' expecting EOF near '<EOF>'
    hive> select t1.cnt1,t2.cnt2 from 
        > (select count(id) cnt1 from psn21) t1,
        > (select count(name) cnt2 from psn21)t2;
    Warning: Map Join MAPJOIN[27][bigTable=?] in task 'Stage-4:MAPRED' is a cross product
    Warning: Map Join MAPJOIN[34][bigTable=?] in task 'Stage-5:MAPRED' is a cross product
    Warning: Shuffle Join JOIN[14][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product
    Query ID = root_20190219145918_2f98437b-7070-41a4-905b-4c6a3a160d46
    Total jobs = 5
    Launching Job 1 out of 5
    Launching Job 2 out of 5
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1548397153910_0018, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0018/
    Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0018
    Starting Job = job_1548397153910_0017, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0017/
    Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0017
    Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
    2019-02-19 14:59:25,322 Stage-3 map = 0%,  reduce = 0%
    2019-02-19 14:59:29,510 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.0 sec
    2019-02-19 14:59:34,612 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 6.25 sec
    MapReduce Total cumulative CPU time: 6 seconds 250 msec
    Ended Job = job_1548397153910_0018
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2019-02-19 14:59:46,581 Stage-1 map = 0%,  reduce = 0%
    2019-02-19 14:59:51,687 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.74 sec
    2019-02-19 14:59:56,796 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.9 sec
    MapReduce Total cumulative CPU time: 5 seconds 900 msec
    Ended Job = job_1548397153910_0017
    Stage-7 is selected by condition resolver.
    Stage-8 is filtered out by condition resolver.
    Stage-2 is filtered out by condition resolver.
    SLF4J: Found binding in [jar:file:/usr/local/apache-hive-3.1.1-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    2019-02-19 15:00:07    Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_14-59-18_586_8660726948780795909-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile21--.hashtable2019-02-19 15:00:07    Uploaded 1 File to: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_14-59-18_586_8660726948780795909-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile21--.hashtable (278 bytes)
    
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 4 out of 5
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1548397153910_0019, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0019/
    Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0019
    Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
    2019-02-19 15:00:16,324 Stage-4 map = 0%,  reduce = 0%
    2019-02-19 15:00:20,426 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 2.95 sec
    MapReduce Total cumulative CPU time: 2 seconds 950 msec
    Ended Job = job_1548397153910_0019
    MapReduce Jobs Launched: 
    Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 6.25 sec   HDFS Read: 13641 HDFS Write: 114 SUCCESS
    Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.9 sec   HDFS Read: 13637 HDFS Write: 114 SUCCESS
    Stage-Stage-4: Map: 1   Cumulative CPU: 2.95 sec   HDFS Read: 6091 HDFS Write: 103 SUCCESS
    Total MapReduce CPU Time Spent: 15 seconds 100 msec
    OK
    t1.cnt1    t2.cnt2
    6    6
    Time taken: 64.206 seconds, Fetched: 1 row(s)
    hive> 

    四、严格模式
    通过设置以下参数开启严格模式:
    set hive.mapred.mode=strict;
    (默认为:nonstrict非严格模式)

    查询限制:
    1、对于分区表,必须添加where对于分区字段的条件过滤;

    hive> set hive.mapred.mode=nonstrict;
    hive> select * from psn22;
    OK
    psn22.id    psn22.name    psn22.likes    psn22.address    psn22.age    psn22.sex
    1    小明1    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    18    boy
    2    小明2    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    20    man
    5    小明5    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    21    boy
    3    小明3    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    21    boy
    6    小明6    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    21    man
    4    小明4    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    21    man
    Time taken: 0.186 seconds, Fetched: 6 row(s)
    hive> set hive.mapred.mode=strict;
    hive> select * from psn22;
    FAILED: SemanticException [Error 10056]: Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.no.partition.filter to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features. No partition predicate for Alias "psn22" Table "psn22"
    hive> select * from psn22 where age=18 and sex='boy';
    OK
    psn22.id    psn22.name    psn22.likes    psn22.address    psn22.age    psn22.sex
    1    小明1    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    18    boy
    Time taken: 0.282 seconds, Fetched: 1 row(s)
    hive> 

    2、order by语句必须包含limit输出限制

    hive> set hive.mapred.mode=strict;
    hive> select * from psn21 order by id;
    FAILED: SemanticException 1:29 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'id'
    hive> select * from psn21 order by id limit 2;
    Automatically selecting local only mode for query
    Query ID = root_20190219143842_b465a76f-a890-4bdc-aa76-b713c3ea13c0
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Job running in-process (local Hadoop)
    2019-02-19 14:38:43,896 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_local1585589360_0001
    MapReduce Jobs Launched: 
    Stage-Stage-1:  HDFS Read: 1696 HDFS Write: 261 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    psn21.id    psn21.name    psn21.age    psn21.sex    psn21.likes    psn21.address
    1    小明1    18    boy    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}
    2    小明2    20    man    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}
    Time taken: 1.89 seconds, Fetched: 2 row(s)
    hive>

    3、限制执行笛卡尔积的查询。

    五、Hive排序
    -Order By - 对于查询结果做全排序,只允许有一个reduce处理
    (当数据量较大时,应慎用。严格模式下,必须结合limit来使用)
    -Sort By - 对于单个reduce的数据进行排序
    -Distribute By - 分区排序,经常和Sort By结合使用
    -Cluster By - 相当于 Sort By + Distribute By
    (Cluster By不能通过asc、desc的方式指定排序规则;
    可通过 distribute by column sort by column asc|desc 的方式)


    六、Hive Join(去掉MapReduce中的shuffle)
    Join计算时,将小表(驱动表)放在join的左边
    Map Join:在Map端完成Join
    两种实现方式:
    1、SQL方式,在SQL语句中添加MapJoin标记(mapjoin hint)
    语法:
    SELECT /*+ MAPJOIN(smallTable) */ smallTable.key, bigTable.value
    FROM smallTable JOIN bigTable ON smallTable.key = bigTable.key;
    2、开启自动的MapJoin
    通过修改以下配置启用自动的mapjoin:
    set hive.auto.convert.join = true;
    (该参数为true时,Hive自动对左边的表统计量,如果是小表就加入内存,即对小表使用Map join)

    相关配置参数:
    hive.mapjoin.smalltable.filesize;
    (大表小表判断的阈值,如果表的大小小于该值则会被加载到内存中运行)
    hive.ignore.mapjoin.hint;
    (默认值:true;是否忽略mapjoin hint 即mapjoin标记)
    hive.auto.convert.join.noconditionaltask;
    (默认值:true;将普通的join转化为普通的mapjoin时,是否将多个mapjoin转化为一个mapjoin)
    hive.auto.convert.join.noconditionaltask.size;
    (将多个mapjoin转化为一个mapjoin时,其表的最大值)


    七、Map-Side聚合(相当于MapReduce中的combine聚合)
    通过设置以下参数开启在Map端的聚合:
    set hive.map.aggr=true;

    相关配置参数:
    hive.groupby.mapaggr.checkinterval:
    map端group by执行聚合时处理的多少行数据(默认:100000)
    hive.map.aggr.hash.min.reduction:
    进行聚合的最小比例(预先对100000条数据做聚合,若聚合之后的数据量/100000的值大于该配置0.5,则不会聚合)
    hive.map.aggr.hash.percentmemory:
    map端聚合使用的内存的最大值
    hive.map.aggr.hash.force.flush.memory.threshold:
    map端做聚合操作是hash表的最大可用内容,大于该值则会触发flush
    hive.groupby.skewindata
    是否对GroupBy产生的数据倾斜做优化,默认为false


    八、控制Hive中Map以及Reduce的数量
    Map数量相关的参数
    mapred.max.split.size
    一个split的最大值,即每个map处理文件的最大值
    mapred.min.split.size.per.node
    一个节点上split的最小值
    mapred.min.split.size.per.rack
    一个机架上split的最小值

    Reduce数量相关的参数:
    mapred.reduce.tasks
    强制指定reduce任务的数量
    hive.exec.reducers.bytes.per.reducer
    每个reduce任务处理的数据量
    hive.exec.reducers.max
    每个任务最大的reduce数

    九、Hive - JVM重用
    适用场景:
    1、小文件个数过多
    2、task个数过多

    通过 set mapred.job.reuse.jvm.num.tasks=n; 来设置(n为task插槽个数)
    缺点:设置开启之后,task插槽会一直占用资源,不论是否有task运行,
    直到所有的task即整个job全部执行完成时,才会释放所有的task插槽资源!

  • 相关阅读:
    流水线电子看板的设计与实现
    ASP.NET MVC 5 Web编程4 -- Razor视图引擎
    实战:上亿数据如何秒查
    ASP.NET MVC 5 Web编程3 -- Controller的应用及扩展
    ASP.NET MVC 5 Web编程2 -- URL映射(路由原理)
    ASP.NET MVC 5 Web编程1 -- 入门
    IIS发布站点错误收集(持续更新)
    .NET垃圾回收(GC)原理
    SSRS(rdl报表)分页显示表头和对表头的冻结处理
    《CLR.via.C#第三版》第二部分第13章节 接口 读书笔记(七)
  • 原文地址:https://www.cnblogs.com/cac2020/p/10401411.html
Copyright © 2011-2022 走看看