zoukankan      html  css  js  c++  java
  • 【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException

    spark查orc格式的数据有时会报这个错

    Caused by: java.lang.NullPointerException
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
    ... 47 more

    跟进代码

    org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

      static enum SplitStrategyKind {
        HYBRID,
        BI,
        ETL
      }
    ...
    
        Context(Configuration conf) {
          this.conf = conf;
          minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
          maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
          String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
          if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {
            splitStrategyKind = SplitStrategyKind.HYBRID;
          } else {
            LOG.info("Enforcing " + ss + " ORC split strategy");
            splitStrategyKind = SplitStrategyKind.valueOf(ss);
          }
    
    ...
            switch(context.splitStrategyKind) {
              case BI:
                // BI strategy requested through config
                splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
                    deltas, covered);
                break;
              case ETL:
                // ETL strategy requested through config
                splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
                    deltas, covered);
                break;
              default:
                // HYBRID strategy
                if (avgFileSize > context.maxSize) {
                  splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
                      covered);
                } else {
                  splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
                      covered);
                }
                break;
            }

    org.apache.hadoop.hive.conf.HiveConf.ConfVars

        HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
            "This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
            " as opposed to query execution (split generation does not read or cache file footers)." +
            " ETL strategy is used when spending little more time in split generation is acceptable" +
            " (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
            " based on heuristics."),

    The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.

    可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足

    if (avgFileSize > context.maxSize) {

    splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
    covered);

    报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题

    set hive.exec.orc.split.strategy=ETL

    问题暂时解决,未完待续;

  • 相关阅读:
    6.00 Introduction to Computer Science and Programming lec5: Objects in Python
    6.00 Introduction to Computer Science and Programming lec3 & lec4
    DB2中查询表信息
    修改 unity “显示桌面”快捷键的方法 (Ubuntu 12.10)
    Ubuntu 12.10中阻止启动chrome时“unlock default keyring ...”弹出窗口的方法
    6.00 Introduction to Computer Science and Programming lec1
    Thinkpad T61显卡门的解决(更换集成显卡的主板)
    Ubuntu 12.10中安装Sun的JDK
    【设计模式】抽象工厂
    【设计模式】概述
  • 原文地址:https://www.cnblogs.com/barneywill/p/10142244.html
Copyright © 2011-2022 走看看