zoukankan      html  css  js  c++  java
  • 一起学Hive——总结常用的Hive优化技巧

    今天总结本人在使用Hive过程中的一些优化技巧,希望给大家带来帮助。Hive优化最体现程序员的技术能力,面试官在面试时最喜欢问的就是Hive的优化技巧。

    技巧1.控制reducer数量

    下面的内容是我们每次在hive命令行执行SQL时都会打印出来的内容:

    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    

    很多人都会有个疑问,上面的内容是干什么用的。我们一一来解答,先看

    set hive.exec.reducers.bytes.per.reducer=<number>,这个一条Hive命令,用于设置在执行SQL的过程中每个reducer处理的最大字节数量。可以在配置文件中设置,也可以由我们在命令行中直接设置。如果处理的数据量大于number,就会多生成一个reudcer。例如,number = 1024K,处理的数据是1M,就会生成10个reducer。我们来验证下上面的说法是否正确:

    1. 执行set hive.exec.reducers.bytes.per.reducer=200000;命令,设置每个reducer处理的最大字节是200000。
    2. 执行sql:
    select user_id,count(1) as cnt 
      from orders group by user_id limit 20; 
    

    执行上面的sql时会在控制台打印出信息:

      Number of reduce tasks not specified. Estimated from input data size: 159
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
    Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0020
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 159
    

    控制台打印的信息中第一句话:Number of reduce tasks not specified. Estimated from input data size: 159。翻译成中文:没有指定reducer任务数量,根据输入的数据量估计会有159个reducer任务。然后在看最后一句话:number of mappers: 1; number of reducers: 159。确定该SQL最终生成159个reducer。因此如果我们知道数据的大小,只要通过set hive.exec.reducers.bytes.per.reducer命令设置每个reducer处理数据的大小就可以控制reducer的数量。

    接着看
    set hive.exec.reducers.max=<number> 这也是一条Hive命令,用于设置Hive的最大reducer数量,如果我们设置number为50,表示reducer的最大数量是50。
    我们来验证下这个说法是否正确:

    1. 执行命令set hive.exec.reducers.max=8;设置reducer的数量为8。
    2. 继续执行sql:
    select user_id,count(1) as cnt 
      from orders group by user_id limit 20; 
    

    会在控制台打印出如下信息:

    Number of reduce tasks not specified. Estimated from input data size: 8
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
    Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0020
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 8
    

    控制台打印的信息中第一句话:Number of reduce tasks not specified. Estimated from input data size: 8。reducer的数据量为8,正好验证了我们的说法。set set hive.exec.reducers.max=8;命令是设置reducer的数量的上界。

    最后来看 set mapreduce.job.reduces=<number>命令。这条Hive命令是设置reducer的数据,在执行sql会生成多少个reducer处理数据。使用和上面同样的方法来验证set mapreduce.job.reduces=这条命令。

    1. 执行命令set mapreduce.job.reduces=5;设置reducer的数量为8。
    2. 继续执行sql:
    select user_id,count(1) as cnt 
      from orders group by user_id limit 20; 
    

    会在控制台打印出如下信息:

    Number of reduce tasks not specified. Defaulting to jobconf value of: 5
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1538917788450_0026, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0026/
    Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0026
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
    

    根据Number of reduce tasks not specified. Defaulting to jobconf value of: 5和number of mappers: 1; number of reducers: 5这两句话,可以知道生成5个reducer。

    如果我们将数量由5改成15。还是执行select user_id,count(1) as cnt
    from orders group by user_id limit 20;SQL,在控制台打印的信息是:

    Launching Job 1 out of 1
    Number of reduce tasks not specified. Defaulting to jobconf value of: 15
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1538917788450_0027, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0027/
    Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0027
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 15
    

    可见reducer的数量已经由5变为15个。

    小结,控制hive中reducer的数量由三种方式,分别是:

    set hive.exec.reducers.bytes.per.reducer=<number> 
    set hive.exec.reducers.max=<number>
    set mapreduce.job.reduces=<number>
    

    其中 set mapreduce.job.reduces=<number>的方式优先级最高, set hive.exec.reducers.max=<number>优先级次之, set hive.exec.reducers.bytes.per.reducer=<number> 优先级最低。从hive0.14开始,一个reducer处理文件的大小的默认值是256M。

    reducer的数量并不是越多越好,我们知道有多少个reducer就会生成多少个文件,小文件过多在hdfs中就会占用大量的空间,造成资源的浪费。如果reducer数量过小,导致某个reducer处理大量的数据(数据倾斜就会出现这样的现象),没有利用hadoop的分而治之功能,甚至会产生OOM内存溢出的错误。使用多少个reducer处理数据和业务场景相关,不同的业务场景处理的办法不同。

    技巧2.使用Map join

    sql中涉及到多张表的join,当有一张表的大小小于1G时,使用Map Join可以明显的提高SQL的效率。如果最小的表大于1G,使用Map Join会出现OOM的错误。
    用法:

    select /*+ MAPJOIN(table_a)*/,a.*,b.* from table_a a join table_b b on a.id = b.id
    

    技巧3.使用distinct + union all代替union

    如果遇到要使用union去重的场景,使用distinct + union all比使用union的效果好。
    distinct + union all的用法:

    select count(distinct *) 
    from (
    select order_id,user_id,order_type from orders where order_type='0' union all
    select order_id,user_id,order_type from orders where order_type='1' union all 
    select order_id,user_id,order_type from orders where order_type='1'
    )a;
    

    union的用法:

    select count(*) 
    from(
    select order_id,user_id,order_type from orders where order_type='0' union
    select order_id,user_id,order_type from orders where order_type='0' union
    select order_id,user_id,order_type from orders where order_type='1')t;
    

    技巧4.解决数据倾斜的通用办法

    数据倾斜的现象:任务进度长时间维持在99%,只有少量reducer任务完成,未完成任务数据读写量非常大,超过10G。在聚合操作是经常发生。
    通用解决方法:set hive.groupby.skewindata=true;
    将一个map reduce拆分成两个map reduce。

    说说我遇到过的一个场景,需用统计某个一天每个用户的访问量,SQL如下:

    select t.user_id,count(*) from user_log t group by t.user_id
    

    执行这条语句之后,发现任务维持在99%达到一个小时。后面自己分析user_log表,发现user_id有很多数据为null。user_id为null的数据会有一个reducer来处理,导致出现数据倾斜的现象。解决方法有两种:
    1、通过where条件过滤掉user_id为null的记录。
    2、将为null的user_id设置一个随机数值。保证所有数据平均的分配到所有的reducer中处理。

  • 相关阅读:
    [原创] 为Visio添加公式编辑器工具栏按钮
    Matlab 图论最短路问题模型代码
    「SCOI2011」「LOJ #2441」 棘手的操作
    「APIO2012」「Luogu P1552」派遣
    「JLOI2015」「LOJ #2107」城池攻占
    「Wallace 笔记」LOJ 「数列分块入门」 9 题题解
    「Codeforces 235C」Cyclical Quest
    「Codeforces 1037H」Security
    「UVA 11468」Substring
    「LOJ #2102」「TJOI2015」弦论
  • 原文地址:https://www.cnblogs.com/airnew/p/9808514.html
Copyright © 2011-2022 走看看