zoukankan      html  css  js  c++  java
  • hive: join 遇到问题

    在表连接时遇到一个问题:

    insert overwrite table BF_EVT_CRD_CRT_TRAD2
    select BF_EVT_CRD_CRT_TRAD.*, jjkdjk.CUST_NO,BF_AGT_CRD_CRT.OUT_CRD_INSTN_CD
    from BF_AGT_CRD_CRT join jjkdjk on (BF_AGT_CRD_CRT.CUST_NO=jjkdjk.pcust_no) join BF_EVT_CRD_CRT_TRAD on (BF_EVT_CRD_CRT_TRAD.CRD_NO= BF_AGT_CRD_CRT.CRD_NO);

      该语句中如果大表有30亿行记录,而小表只有100行记录,而且那么大表中数据倾斜特别严重,有一个key上有15亿行记录,在运行过程中特别的慢,而且在reduece的过程中遇有内存不够而报错。

    考虑map join 的原理:

    MAPJION会把小表全部读入内存中,在map阶段直接拿另外一个表的数据和内存中表数据做匹配,由于在map是进行了join操作,省去了reduce运行的效率也会高很多

    解决思路:

    BF_AGT_CRD_CRT  count(*)  4031974
    jjkdjk  count(*)  3912676

    BF_EVT_CRD_CRT_TRAD  count(*)  251512826
    采用hint方式启动数据驱动,如:
    select f.a,f.b from A t join B f  on ( f.a=t.a and f.ftime=20110802)  
    改为
    select /*+ mapjoin(A)*/ f.a,f.b from A t join B f  on ( f.a=t.a and f.ftime=20110802) 
    

      

    insert overwrite table BF_EVT_CRD_CRT_TRAD2
     select /*+ mapjoin(BF_AGT_CRD_CRT)*/BF_EVT_CRD_CRT_TRAD.*, jjkdjk.CUST_NO,BF_AGT_CRD_CRT.OUT_CRD_INSTN_CD
     from   BF_AGT_CRD_CRT join jjkdjk on (BF_AGT_CRD_CRT.CUST_NO=jjkdjk.pcust_no) join BF_EVT_CRD_CRT_TRAD on (BF_EVT_CRD_CRT_TRAD.CRD_NO= BF_AGT_CRD_CRT.CRD_NO);
    

     但还是报错。

    Total MapReduce jobs = 4
    2014-10-22 05:45:06     Starting to launch local task to process map join; maximum memory = 1065484288
    2014-10-22 05:45:42     Processing rows:        200000  Hashtable size: 199999      Memory usage:   82761296        percentage:     0.078
    2014-10-22 05:45:45     Processing rows:        300000  Hashtable size: 299999      Memory usage:   114515648       percentage:     0.107
    2014-10-22 05:45:47     Processing rows:        400000  Hashtable size: 399999      Memory usage:   148324312       percentage:     0.139
    .......
    2014-10-22 05:46:37     Processing rows:        2400000 Hashtable size: 2399999     Memory usage:   851355056       percentage:     0.799
    2014-10-22 05:46:46     Processing rows:        2500000 Hashtable size: 2499999     Memory usage:   888876848       percentage:     0.834
    2014-10-22 05:46:47     Processing rows:        2600000 Hashtable size: 2599999     Memory usage:   934695048       percentage:     0.877
    2014-10-22 05:46:48     Processing rows:        2700000 Hashtable size: 2699999     Memory usage:   973416544       percentage:     0.914
    Execution failed with exit status: 3
    Obtaining error information
    
    Task failed!
    Task ID:
      Stage-12
    
    Logs:
    
    /tmp/root/hive.log
    FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
    

      

    分析原因是:

    任务自动把join装换mapjoin时内存溢出,解决法子:关闭自动装换,11前的版本默认值为false,后面的为true;

    所以hive默认配置参数为set hive.auto.convert.join = true;

    首先把小的表加入内存,hive自动根据sql,选择使用common join或者map join,导致只针对小表来确定mapreduce个数和运行空间,而大表根本就处理不了。

    而hive.mapjoin.smalltable.filesize 默认值是25mb

    set mapreduce.map.memory.mb=2049;
    set mapreduce.reduce.memory.mb=20495;
    set hive.auto.convert.join=false;
    insert overwrite table BF_EVT_CRD_CRT_TRAD2
    select BF_EVT_CRD_CRT_TRAD.*, jjkdjk.CUST_NO,BF_AGT_CRD_CRT.OUT_CRD_INSTN_CD
    from   BF_AGT_CRD_CRT join jjkdjk on (BF_AGT_CRD_CRT.CUST_NO=jjkdjk.pcust_no) join BF_EVT_CRD_CRT_TRAD on (BF_EVT_CRD_CRT_TRAD.CRD_NO= BF_AGT_CRD_CRT.CRD_NO);
    

      

  • 相关阅读:
    groovy main method is use static main(args) //ok
    undefined reference to
    CuTest: C Unit Testing Framework
    screen to tmux: A Humble Quickstart Guide « My Humble Corner
    main,tmain,winmain()等函数——UNICODE sensensen 博客园
    Adding Unit Tests to a C Project NetBeans IDE 6.9 Tutorial
    罗马转数字
    About Luvit
    KISSY Keep It Simple & Stupid, Short & Sweet, Slim & Sexy...
    Create a CSV file
  • 原文地址:https://www.cnblogs.com/kxdblog/p/4043242.html
Copyright © 2011-2022 走看看