zoukankan      html  css  js  c++  java
  • 【原创】大叔问题定位分享(2)spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT

    最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下:

    18/03/15 21:50:36 116 ERROR ApplicationMaster91: User class threw exception: org.apache.spark.sql.AnalysisException: java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT;

    org.apache.spark.sql.AnalysisException: java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT;

             at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)

             at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)

             at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)

             at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)

             at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)

             at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

             at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

             at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)

             at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

             at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)

             at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)

             at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)

             at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)

             at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)

             at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

             at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)

             at scala.util.control.Breaks.breakable(Breaks.scala:38)

             at app.package.APPClass$.main(APPClass.scala:177)

             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

             at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

             at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

             at java.lang.reflect.Method.invoke(Method.java:497)

             at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)

    Caused by: java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT

             at org.apache.hadoop.hive.ql.metadata.Hive.trashFilesUnderDir(Hive.java:1389)

             at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2873)

             at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1621)

             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

             at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

             at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

             at java.lang.reflect.Method.invoke(Method.java:497)

             at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:728)

             at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:676)

             at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:676)

             at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:676)

             at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)

             at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)

             at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)

             at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)

             at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:675)

             at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:768)

             at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:766)

             at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:766)

             at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)

             ... 25 more

    大概流程是spark sql在执行InsertIntoHiveTable时会调用loadTable,这个操作最终会通过反射调用hive代码的loadTable方法

    1. org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
    2. org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
    3. org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:728)
    4. java.lang.reflect.Method.invoke(Method.java:497)
    5. org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1621)
    6. org.apache.hadoop.hive.ql.metadata.Hive.trashFilesUnderDir(Hive.java:1389)

    在第6步中报错 java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT

    这个问题通常会认为是hive-site.xml缺少配置

      <property>

        <name>hive.mv.files.thread</name>

        <value>15</value>

      </property>

    但是查看代码会发现spark2.1.1依赖的是hive1.2.1,在hive1.2.1中是没有hive.mv.files.thread这个配置的,这个配置从hive2才开始出现,而且报错的类org.apache.hadoop.hive.ql.metadata.Hive在hive1.2.1和hive2的相关代码完全不同,具体分写如下:

    在hive1.2.1的代码是:(trashFilesUnderDir方法是FileUtils类的

                if (FileUtils.isSubDir(oldPath, destf, fs2)) {
    
                  FileUtils.trashFilesUnderDir(fs2, oldPath, conf);
    
                }

    在hive2的代码是:(trashFilesUnderDir方法是Hive类的

      private boolean trashFilesUnderDir(final FileSystem fs, Path f, final Configuration conf)
    
          throws IOException {
    
        FileStatus[] statuses = fs.listStatus(f, FileUtils.HIDDEN_FILES_PATH_FILTER);
    
        boolean result = true;
    
        final List<Future<Boolean>> futures = new LinkedList<>();
    
        final ExecutorService pool = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25) > 0 ?
    
            Executors.newFixedThreadPool(conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25),
    
            new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Delete-Thread-%d").build()) : null;

    所以第6步的报错,执行的应该是hive2的代码,所以猜测问题可能是:

    1)由于jar包污染,启动jvm的时候classpath里同时有hive1和hive2的jar,有时加载类用到hive1的jar(可以正常运行),有时用到hive2的jar(会报错);

    2)集群服务器环境配置差异,有的服务器classpath中没有hive2的jar(可以正常运行),有的服务器classpath有hive2的jar(可能报错);

    对比正常和报错的服务器的环境配置以及启动命令发现都是一样的,没有发现hive2的jar,

    通过在启动任务时增加-verbose:class,发现正常和报错的情况下,Hive类都是从Hive1.2.1的jar加载出来的,

    [Loaded org.apache.hadoop.hive.ql.metadata.Hive from file:/export/Data/tmp/hadoop-tmp/nm-local-dir/filecache/98/hive-exec-1.2.1.spark2.jar]

    否定了上边的两种猜测;

    分析提交任务命令发现,用到了spark.yarn.jars,避免每次都上传spark的jar,这些jar会被作为filecache缓存在yarn.nodemanager.local-dirs下,

    反编译正常和报错服务器上filecache里的hive-exec-1.2.1.spark2.jar最终发现问题,

    正常服务器上Hive类代码是:

        if (FileUtils.isSubDir(oldPath, destf, fs2))
    
            FileUtils.trashFilesUnderDir(fs2, oldPath, conf);

    报错服务器上的Hive类代码是:

        private static boolean trashFilesUnderDir(final FileSystem fs, Path f, final Configuration conf) throws IOException {
    
            FileStatus[] statuses = fs.listStatus(f, FileUtils.HIDDEN_FILES_PATH_FILTER);
    
            boolean result = true;
    
            List<Future<Boolean>> futures = new LinkedList();
    
            ExecutorService pool = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25) > 0 ? Executors.newFixedThreadPool(conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25), (new ThreadFactoryBuilder()).setDaemon(true).setNameFormat("Delete-Thread-%d").build()) : null;

    报错服务器上的Hive类用到ConfVars.HIVE_MOVE_FILES_THREAD_COUNT,但是在hive-common-1.2.1.jar中的ConfVars不存在这个属性,所以报错java.lang.NoSuchFieldError

    所以问题应该是hdfs上的hive-exec-1.2.1.spark2.jar一开始是对的,然后所有nodemanager下载到本地作为filecache,后来这个jar被改错了(使用hive2编译spark),然后新加的nodemanager会下载有问题的jar作为filecache,这样结果就是有的服务器执行正常,有的服务器执行报错;

    yarn中的filecache清理有两个配置

    yarn.nodemanager.localizer.cache.cleanup.interval-ms:600000  Interval in between cache cleanups.

    yarn.nodemanager.localizer.cache.target-size-mb:10240    Target size of localizer cache in MB, per local directory.

    每隔cleanup.interval-ms会检查本地filecache大小是否超过target-size-mb,超过才清理,不超过就一直使用filecache;

  • 相关阅读:
    ElasticSearch(ES)学习笔记
    Lucene學習日志
    velocity代码生成器的使用
    springboot学习笔记
    springmvc json 类型转换错误
    在做del业务时,传递参数,和接口中入参注释
    做add添加业务时,字符集乱码,form标签库,button的href 问题,添加后页面跳转,forward,redirect 。定制错误输出
    mybatis中联合查询的返回结果集
    mybatis分页,绝对路径的2种写法
    maven导入项目时报错,配置应用程序监听器[org.springframework.web.context.ContextLoaderListener]错误
  • 原文地址:https://www.cnblogs.com/barneywill/p/9896294.html
Copyright © 2011-2022 走看看