zoukankan      html  css  js  c++  java
  • hive优化

    prepare:
    ------------

    CLUSTERED BY  将数据分组以进入不同的bucket中 
    INTO num_buckets BUCKETS]
    SKEWED BY 对于倾斜的数据,指定在哪些值倾斜,从而做优化。

    较群面的分析了hive优化

    如何配置yarn的内存;
    提供了一个脚本生成参考配置值;

    With the following options:

    OptionDescription
    -c CORESThe number of cores on each host.
    -m MEMORYThe amount of memory on each host in GB.
    -d DISKSThe number of disks on each host.
    -k HBASE"True" if HBase is installed, "False" if not.

    Note: You can also use the -h or --help option to display a Help message that describes the options.

    Running the following command:

    1. [root@jason3 scripts]# python yarn-utils.py -c 24 -m 64 -d 12 -k False
    2. Using cores=24 memory=64GB disks=12 hbase=False
    3. Profile: cores=24 memory=57344MB reserved=8GB usableMem=56GB disks=12
    4. Num Container=22
    5. Container Ram=2560MB
    6. Used Ram=55GB
    7. Unused Ram=8GB
    8. yarn.scheduler.minimum-allocation-mb=2560
    9. yarn.scheduler.maximum-allocation-mb=56320
    10. yarn.nodemanager.resource.memory-mb=56320
    11. mapreduce.map.memory.mb=2560
    12. mapreduce.map.java.opts=-Xmx2048m
    13. mapreduce.reduce.memory.mb=2560
    14. mapreduce.reduce.java.opts=-Xmx2048m
    15. yarn.app.mapreduce.am.resource.mb=2560
    16. yarn.app.mapreduce.am.command-opts=-Xmx2048m
    17. mapreduce.task.io.sort.mb=1024
    18. [root@jason3 scripts]#
    mapjoin:
     hive.auto.convert.join (if set to true) automatically converts the joins to mapjoins at runtime if possible, and it should be used instead of the mapjoin hint
    • hive.auto.convert.join.noconditionaltask - Whether Hive enable the optimization about converting common join into mapjoin based on the input file size. If this paramater is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a mapjoin (there is no conditional task).
    • hive.auto.convert.join.noconditionaltask.size - If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB.



      1. 系统资源统计。用top,sysstat等工具监控整个系统资源使用情况。
      2. binary instrumentation。这方面也有很多工具,如hprof, jprof, btrace等等。特别是btrace值得看看,它可以动态的插入profile代码。
      3. Hadoop提供的JMX bean信息。JMX是Java一个监控和管理的标准,Hadoop代码中有部分关键信息通过JMX接口暴露出来。
      4. Hadoop的log。这方面有专门的Hadoop的分析工具,如Vaidya,Kahuna。其他通用的log分析工具也有很多。
         


      Hive及Hadoop作业调优.pdf
      -------------
      mapred.map.tasks 期望的map个数 默认值:1, 可增大map数
      mapred.min.split.size 切割出的split最小size 默认:1 ,可减少map数
      mapred.max.split.size 切割出的split最大size 默认:Long.MAX_VALUE ,增加map数







      set mapred.max.split.size=1000000;
      set hive.optimize.bucketmapjoin=true;
      set hive.optimize.bucketmapjoin.sortedmerge=true;
      set hive.groupby.skewindata=true;(This setting may reduce performance for data that is not heavily skewed.)

      Storage File Format: 缺省是序列文件;ORC;压缩是在cpu与磁盘/IO之间的权衡;

      Partitioning:数据提前按field分片,那么where相关的过滤变快;

      Bucketing: 数据被提前按照某些key做hash分片了,所以group by和join等需要reduce的操作就变快(reduce默认使用hash分片)



      Benchmarking Apache Hive 13 for Enterprise Hadoop http://zh.hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/ 

      Tez and MapReduce were tuned to process all queries using 4 GB containers at a target container-to-disk ratio of 2.0. The ratio is important because it minimizes disk thrash and maximizes throughput.

      Other Settings:

      • yarn.nodemanager.resource.memory-mb was set to 49152
      • Default virtual memory for a job’s map-task and reduce-task were set to 4096
      • hive.tez.container.size was set to 4096
      • hive.tez.java.opts was set to -Xmx3800m
      • Tez app masters were given 8 GB
      • mapreduce.map.java.opts and mapreduce.reduce.java.opts were set to -Xmx3800m. This is smaller than 4096 to allow for some garbage collection overhead
      • hive.auto.convert.join.noconditionaltask.size was set to 1252698795

      Note:  this is 1/3 of the Xmx value, about 1.7 GB.

      The following additional optimizations were used for Hive 0.13.0:

      • Vectorized Query enabled
      • ORCFile formatted data
      • Map-join auto conversion enabled


      Hive Performance Tuning

      Use Hive’s Mapjoin: 使用注释或者启用自动识别;
      SELECT /*+ MAPJOIN(tbl2) */ ... FROM tbl1 join tbl2 on tbl1.key = tbl2.key
      DISTRIBUTE BY…SORT BY v. ORDER BY: order by是全序,效率低;
      Avoid “SELECT count(DISTINCT field) FROM tbl” ,使用代替:SELECT   count(1) FROM (   SELECT DISTINCT field FROM tbl ) t







  • 相关阅读:
    团队作业(二):项目选题
    今天准备正式开博了!专注于Silverlight!
    ORA01033:ORACLE initialization or shutdown in progress 错误的解决办法
    寻找正在应用和准备学习XNAor3D技术的志同道合的伙伴,大家能够互帮互助,共同探讨,最好能够组成较固定的小团队!
    TNS: could not resolve the connect identifier specified
    今天连接字符串出现了“ORA01008: 并非所有变量都已绑定”错误
    选择HttpHandler还是HttpModule?
    jQuerymenuaim.js
    MVP解读
    揭秘Amazon反应速度超快的下拉菜单
  • 原文地址:https://www.cnblogs.com/zwCHAN/p/4494053.html
Copyright © 2011-2022 走看看