zoukankan      html  css  js  c++  java
  • spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo

    问题复现:

    G:igdataspark-2.3.3-bin-hadoop2.7in>spark-shell
    2020-12-26 10:20:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://DESKTOP-01KN1P4:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608949256544).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.3.3
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show
    ++
    ||
    ++
    ++
    
    (其他窗口新建一个空文件) touch /tmp/empty_orc/zero.orc
    
    scala> sql("select * from empty_orc").show
    
    java.lang.RuntimeException: serious problem
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
      at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340)
      at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
      at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
      at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
      at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
      ... 49 elided
    Caused by: java.lang.NullPointerException
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
      ... 99 more

    该问题的主要原因是在读取orc表时,遇到有空文件时报错,bug记录地址:

    SPARK-19809:NullPointerException on zero-size ORC file(https://issues.apache.org/jira/browse/SPARK-19809)

    SPARK-29773:Unable to process empty ORC files in Hive Table using Spark SQL(https://issues.apache.org/jira/browse/SPARK-29773)

    解决办法:使用参数spark.sql.hive.convertMetastoreOrc=true

    G:igdataspark-2.3.3-bin-hadoop2.7in>spark-shell --conf spark.sql.hive.convertMetastoreOrc=true
    2020-12-26 10:29:06 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://DESKTOP-01KN1P4:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608949754291).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.3.3
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sql("select * from empty_orc").show
    
    +---+
    |  a|
    +---+
    +---+

    spark的帮助文档种介绍如下:

    ORC Files

    Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader is set to true. For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is also set to true.

    https://spark.apache.org/docs/2.3.3/sql-programming-guide.html#orc-files

  • 相关阅读:
    GitFlow 工作流指南
    第一个 Spring Boot 应用程序
    Spring Boot MyBatis
    JavaScript的并且&&
    利用JavaScript的%读分秒
    利用JavaScript的%做隔行换色
    利用JavaScript制作计算器
    利用JavaScript制作简易日历
    javascript实现选项卡切换的4种方法
    循环
  • 原文地址:https://www.cnblogs.com/flowerbirds/p/14191707.html
Copyright © 2011-2022 走看看