zoukankan      html  css  js  c++  java
  • spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo

    问题复现:

    G:igdataspark-2.3.3-bin-hadoop2.7in>spark-shell
    2020-12-26 10:20:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://DESKTOP-01KN1P4:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608949256544).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.3.3
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show
    ++
    ||
    ++
    ++
    
    (其他窗口新建一个空文件) touch /tmp/empty_orc/zero.orc
    
    scala> sql("select * from empty_orc").show
    
    java.lang.RuntimeException: serious problem
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
      at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340)
      at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
      at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
      at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
      at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
      ... 49 elided
    Caused by: java.lang.NullPointerException
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
      ... 99 more

    该问题的主要原因是在读取orc表时,遇到有空文件时报错,bug记录地址:

    SPARK-19809:NullPointerException on zero-size ORC file(https://issues.apache.org/jira/browse/SPARK-19809)

    SPARK-29773:Unable to process empty ORC files in Hive Table using Spark SQL(https://issues.apache.org/jira/browse/SPARK-29773)

    解决办法:使用参数spark.sql.hive.convertMetastoreOrc=true

    G:igdataspark-2.3.3-bin-hadoop2.7in>spark-shell --conf spark.sql.hive.convertMetastoreOrc=true
    2020-12-26 10:29:06 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://DESKTOP-01KN1P4:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608949754291).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.3.3
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sql("select * from empty_orc").show
    
    +---+
    |  a|
    +---+
    +---+

    spark的帮助文档种介绍如下:

    ORC Files

    Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader is set to true. For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is also set to true.

    https://spark.apache.org/docs/2.3.3/sql-programming-guide.html#orc-files

  • 相关阅读:
    jQuery Ajax同步参数导致浏览器假死怎么办
    自顶而下系统构架分析
    IEnumerable,IQueryable之前世今生
    C#执行存储过程
    JQuery iframe
    跨服务器插入查询数据
    使用游标、存储过程、pivot 三种方法导入数据
    分库分表的面试题3
    分库分表的面试题2
    分库分表的面试题1
  • 原文地址:https://www.cnblogs.com/flowerbirds/p/14191707.html
Copyright © 2011-2022 走看看