zoukankan      html  css  js  c++  java
  • 大数据面试题知识点分析(七)

    转自: https://blog.csdn.net/qq_26803795/article/details/79466592


    本篇博客继续HIVE,将所有HIVE优化相关的内容深入清楚:


    hive 优化:

    1)Map的优化

       • 增加map的个数:
            set mapred.map.tasks=10;
        • 减少map的个数(合并小文件):
        set mapred.max.split.size=100000000;
        set mapred.min.split.size.per.node=100000000;
        set mapred.min.split.size.per.rack=100000000;
        set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

    Map端聚合(combiner):


            hive.map.aggr=true;

    2)Reduce的优化

        • 设置reduce的个数:
            set mapred.reduce.tasks=10;
        • reduce任务处理的数据量
            set hive.exec.reducers.bytes.per.reducer=100000;

        • 避免使用可能启动mapreduce的查询语句

            1)group by
            2)order by(改用distribute by和sort by)

    3)Join的优化

        • Join on的条件:
    1.  
      SELECT a.val, b.val, c.val
    2.  
      FROM a
    3.  
      JOIN b ON (a.key = b.key1)
    4.  
      JOIN c ON (a.key = c.key1)

        • Join的顺序:
     
    [sql] view plain copy
     
     
    print?
    1. <code class=“language-html”>    /+ STREAMTABLE(a) / :a被视为大表  
    2.     /+ MAPJOIN(b) /:b被视为小表   
    3.   
    4.     SELECT /+ STREAMTABLE(a) / a.val, b.val, c.val  
    5.     FROM a  
    6.     JOIN b ON (a.key = b.key1)  
    7.     JOIN c ON (c.key = b.key1);</code>  
    1.  
      /+ STREAMTABLE(a) / :a被视为大表
    2.  
      /+ MAPJOIN(b) /:b被视为小表
    3.  
       
    4.  
      SELECT /+ STREAMTABLE(a) / a.val, b.val, c.val
    5.  
      FROM a
    6.  
      JOIN b ON (a.key = b.key1)
    7.  
      JOIN c ON (c.key = b.key1);

    4)数据倾斜的优化

        • 万能方法:
            hive.groupby.skewindata=true
        • 大小表关联:
            Small_table join big_table

        • 数据中有大量0或NULL:

    1.  
      on case when (x.uid = '-' or x.uid = '0‘ or x.uid is null)
    2.  
      then concat('dp_hive_search',rand()) else x.uid
    3.  
      end = f.user_id;

        • 大大表关联:

    1.  
      Select/+MAPJOIN(t12)/
    2.  
      from dw_log t11
    3.  
      join (
    4.  
      select/+MAPJOIN(t)/ t1.
    5.  
      from (
    6.  
      select user_id from dw_log group by user_id
    7.  
      ) t
    8.  
      join dw_user t1
    9.  
      on t.user_id=t1.user_id
    10.  
      ) t12
    11.  
      on t11.user_id=t12.user_id

        • count distinct时存在大量特殊值:

    1.  
      select cast(count(distinct user_id)+1 as bigint) as user_cnt
    2.  
      from tab_a
    3.  
      where user_id is not null and user_id <> ''

        • 空间换时间:

    1.  
      select day,
    2.  
      count(case when type='session' then 1 else null end) as session_cnt,
    3.  
      count(case when type='user' then 1 else null end) as user_cnt
    4.  
      from (
    5.  
      select day,session_id,type
    6.  
      from (
    7.  
      select day,session_id,'session' as type
    8.  
      from log
    9.  
      union all
    10.  
      select day user_id,'user' as type
    11.  
      from log
    12.  
      )
    13.  
      group by day,session_id,type
    14.  
      ) t1
    15.  
      group by day

    5)其他的优化

        • 分区裁剪(partition):

            Where中的分区条件,会提前生效,不必特意做子查询,直接Join和GroupBy

        • 笛卡尔积:

            Join的时候不加on条件或者无效的on条件,Hive只能使用1个reducer来完成笛卡尔积

        • Union all:

        先做union all再做join或group by等操作可以有效减少MR过程,多个Select,也只需一个MR

        • Multi-insert & multi-group by:

        从一份基础表中按照不同的维度,一次组合出不同的数据

    1.  
      FROM from_statement
    2.  
      INSERT OVERWRITE TABLE table1 [PARTITION (partcol1=val1)] select_statement1 group by key1
    3.  
      INSERT OVERWRITE TABLE table2 [PARTITION(partcol2=val2 )] select_statement2 group by key2
        • Automatic merge:

        当文件大小比阈值小时,hive会启动一个mr进行合并

                hive.merge.mapfiles = true 是否和并 Map 输出文件,默认为 True
                hive.merge.mapredfiles = false 是否合并 Reduce 输出文件,默认为 False
                hive.merge.size.per.task = 256*1000*1000 合并文件的大小

        • Multi-Count Distinct:

        一份表中count多个参数(必须设置参数:set hive.groupby.skewindata=true;)
    1.  
      select dt, count(distinct uniq_id), count(distinct ip)
    2.  
      from ods_log where dt=20170301 group by dt
        • 并行实行:

            hive执行开启:set hive.exec.parallel=true

     



  • 相关阅读:
    Python统计后台数据
    jemter上传到服务器进行压测
    Python的文件读写
    Git中的Stash Changes和UnStash Changes用法
    算法集锦
    各种排序介绍
    SQLServerDAL的引用找不到
    抽象是什么?
    自己写的js分页
    动手写代码
  • 原文地址:https://www.cnblogs.com/tongxupeng/p/10259518.html
Copyright © 2011-2022 走看看