zoukankan      html  css  js  c++  java
  • spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo

    问题复现:

    G:igdataspark-2.3.3-bin-hadoop2.7in>spark-shell
    2020-12-26 10:20:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://DESKTOP-01KN1P4:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608949256544).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.3.3
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show
    ++
    ||
    ++
    ++
    
    (其他窗口新建一个空文件) touch /tmp/empty_orc/zero.orc
    
    scala> sql("select * from empty_orc").show
    
    java.lang.RuntimeException: serious problem
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
      at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
      at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340)
      at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
      at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
      at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
      at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
      at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
      ... 49 elided
    Caused by: java.lang.NullPointerException
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
      at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
      ... 99 more

    该问题的主要原因是在读取orc表时,遇到有空文件时报错,bug记录地址:

    SPARK-19809:NullPointerException on zero-size ORC file(https://issues.apache.org/jira/browse/SPARK-19809)

    SPARK-29773:Unable to process empty ORC files in Hive Table using Spark SQL(https://issues.apache.org/jira/browse/SPARK-29773)

    解决办法:使用参数spark.sql.hive.convertMetastoreOrc=true

    G:igdataspark-2.3.3-bin-hadoop2.7in>spark-shell --conf spark.sql.hive.convertMetastoreOrc=true
    2020-12-26 10:29:06 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://DESKTOP-01KN1P4:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608949754291).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.3.3
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sql("select * from empty_orc").show
    
    +---+
    |  a|
    +---+
    +---+

    spark的帮助文档种介绍如下:

    ORC Files

    Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader is set to true. For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is also set to true.

    https://spark.apache.org/docs/2.3.3/sql-programming-guide.html#orc-files

  • 相关阅读:
    C# 序列化与反序列化之DataContract与xml对子类进行序列化的解决方案
    C# 序列化与反序列化之Binary与Soap无法对泛型List<T>进行序列化的解决方案
    大端小端存储方案
    C# 序列化与反序列化Serialization之Json Xml Binary Soap JavaScript序列化
    C# ctpclient networkstream 使用 BinaryReader的ReadString但是使用streamReader的Readtoend不行
    C# TcpListener TcpClient
    C# System.Net.Mail.MailMessage 发邮件
    C# System.Web.Mail.MailMessage 发邮件
    离线环境下使用二进制方式安装配置Kubernetes集群
    Kubernetes基础:查看状态、管理服务
  • 原文地址:https://www.cnblogs.com/flowerbirds/p/14191707.html
Copyright © 2011-2022 走看看