zoukankan      html  css  js  c++  java
  • Etl之HiveSql调优(设置map reduce 的数量)

    前言:

    最近发现hivesql的执行速度特别慢,前面我们已经说明了left和union的优化,下面咱们分析一下增加或者减少reduce的数量来提升hsql的速度。

    参考:http://www.cnblogs.com/liqiu/p/4873238.html

    分析:

    select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10'; 

    上一篇博文已经说明了,需要8个map,1个reduce,执行的速度:52秒。详细记录参考:http://www.cnblogs.com/liqiu/p/4873238.html

    增加Reduce的数量:

    首先说明一下reduce默认的个数:(每个reduce任务处理的数据量,默认为1000^3=1G,参数是hive.exec.reducers.bytes.per.reducer);(每个任务最大的reduce数,默认为999,参数是hive.exec.reducers.max)

    即,如果reduce的输入(map的输出)总大小不超过1G,那么只会有一个reduce任务;

    如果数据表b2c_money_trace的大小是2.4G,那么reduce的数量是3个,例如:

    hive> select count(1) from b2c_money_trace where operate_time = '2015-10-10' group by operate_time;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 3
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
    Starting Job = job_1434099279301_3623421, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3623421/
    Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3623421
    Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 3

    那么继续说最开始的例子,例如:

    set mapred.reduce.tasks = 8; 

    执行的结果:

    hive> set mapred.reduce.tasks = 8;                                                                                                    
    hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10';
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Defaulting to jobconf value of: 8
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
    Cannot run job locally: Input Size (= 380265495) is larger than hive.exec.mode.local.auto.inputbytes.max (= 50000000)
    Starting Job = job_1434099279301_3618454, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3618454/
    Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3618454
    Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 8
    2015-10-14 15:31:55,570 Stage-1 map = 0%,  reduce = 0%
    2015-10-14 15:32:01,734 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.63 sec
    2015-10-14 15:32:02,760 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 10.93 sec
    2015-10-14 15:32:03,786 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 10.93 sec
    2015-10-14 15:32:04,812 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
    2015-10-14 15:32:05,837 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
    2015-10-14 15:32:06,892 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
    2015-10-14 15:32:07,947 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
    2015-10-14 15:32:08,983 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
    2015-10-14 15:32:10,039 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
    2015-10-14 15:32:11,088 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
    2015-10-14 15:32:12,114 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
    2015-10-14 15:32:13,143 Stage-1 map = 75%,  reduce = 19%, Cumulative CPU 24.28 sec
    2015-10-14 15:32:14,170 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 27.94 sec
    2015-10-14 15:32:15,197 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 27.94 sec
    2015-10-14 15:32:16,224 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 28.58 sec
    2015-10-14 15:32:17,250 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 28.95 sec
    2015-10-14 15:32:18,277 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 37.02 sec
    2015-10-14 15:32:19,305 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 48.93 sec
    2015-10-14 15:32:20,332 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 49.31 sec
    2015-10-14 15:32:21,359 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 57.99 sec
    2015-10-14 15:32:22,385 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 61.88 sec
    2015-10-14 15:32:23,411 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 71.56 sec
    2015-10-14 15:32:24,435 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 71.56 sec
    MapReduce Total cumulative CPU time: 1 minutes 11 seconds 560 msec
    Ended Job = job_1434099279301_3618454
    MapReduce Jobs Launched: 
    Job 0: Map: 8  Reduce: 8   Cumulative CPU: 71.56 sec   HDFS Read: 380267639 HDFS Write: 330 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 11 seconds 560 msec
    OK
    9718    210296076
    9718    210299105
    9718    210295344
    9718    210295277
    9718    210295586
    9718    210295050
    9718    210301363
    9718    210297733
    9718    210298066
    9718    210295566
    9718    210298219
    9718    210296438
    9718    210298328
    9718    210298008
    9718    210299712
    9718    210295239
    9718    210297567
    9718    210295525
    9718    210294949
    9718    210296318
    9718    210294421
    9718    210295840
    Time taken: 36.978 seconds, Fetched: 22 row(s)

    可见8个reduce使得reduce的时间明显提升了。

    增加Map的数量:

    数据表大小:

    map的数量就不能用上面的事例,那么看这个数据表:

    hive> dfs -ls -h /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace;
    Found 4 items
    -rw-r--r--   3 ticketdev ticketdev    600.0 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/24f19a74-ca91-4fb2-9b79-1b1235f1c6f8
    -rw-r--r--   3 ticketdev ticketdev    597.2 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/34ca13a3-de44-402e-9548-e6b9f92fde67
    -rw-r--r--   3 ticketdev ticketdev    590.6 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/ac249f44-60eb-4bf7-9c1a-6f643873b823
    -rw-r--r--   3 ticketdev ticketdev    606.5 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/f587fec9-60da-4f18-8b47-406999d95fd1

    共2.4G

    数据块大小:

    hive> set dfs.block.size;
    dfs.block.size=134217728

    注意:134217728L是128M的意思!

    map数量

    文件大小是600M*4个,每个数据块是128M,即:取整(600/128)*4=20个Mapper

    hive> select count(1) from b2c_money_trace;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
    Starting Job = job_1434099279301_3620170, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3620170/
    Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3620170
    Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 1

    注意上面的红色部分,说明mappers的数量是20。

    那么设置划分map的文件大小

    set mapred.max.split.size=50000000;
    set mapred.min.split.size.per.node=50000000;
    set mapred.min.split.size.per.rack=50000000;
    set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

    大概解释一下:

    50000000表示50M;

    set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;这个参数表示执行前进行小文件合并,当然这里没有使用到。

    其他三个参数说明按照50M来划分数据块。

    执行结果:

    hive> select count(1) from b2c_money_trace;       
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
    Starting Job = job_1434099279301_3620223, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3620223/
    Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3620223
    Hadoop job information for Stage-1: number of mappers: 36; number of reducers: 1

    每个文件600M,正好12个Mapper,所以36个Mappers,注意上面的红色部分。

    结论:

    并非map和reduce数量越多越好,因为越多占用的资源越多,同时处理的时间未必一定增加,最好根据实际情况调整到一个合理的数量。

    参考文章

    http://lxw1234.com/archives/2015/04/15.htm

  • 相关阅读:
    读书笔记之理想设计的特征
    一些javascript 变量声明的 疑惑
    LINQ 使用方法
    Google MySQL tool releases
    读书笔记之设计的层次
    EF之数据库连接问题The specified named connection is either not found in the configuration, not intended to be used with the Ent
    转载 什么是闭包
    javascript面向对象起步
    Tips
    数据结构在游戏中的应用
  • 原文地址:https://www.cnblogs.com/liqiu/p/4878078.html
Copyright © 2011-2022 走看看