zoukankan      html  css  js  c++  java
  • ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...

    0: jdbc:hive2://master01.hadoop.dtmobile.cn:1> select * from cell_random_grid_tmp2 limit 1;
    INFO : Compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1
    INFO : Semantic Analysis Completed
    INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:grid_row_id, type:int, comment:null), FieldSchema(name:grid_col_id, type:int, comment:null), FieldSchema(name:google_gri, type:int, comment:null), FieldSchema(name:google_gci, type:int, comment:null), FieldSchema(name:user_lon, type:double, comment:null), FieldSchema(name:user_lat, type:double, comment:null), FieldSchema(name:grid_type, type:int, comment:null), FieldSchema(name:grid_height, type:int, comment:null), FieldSchema(name:compute_region_name, type:string, comment:null), FieldSchema(name:antenna_0, type:string, comment:null), FieldSchema(name:antenna_1, type:string, comment:null), FieldSchema(name:antenna_2, type:string, comment:null), FieldSchema(name:antenna_3, type:string, comment:null), FieldSchema(name:antenna_4, type:string, comment:null), FieldSchema(name:antenna_5, type:string, comment:null), FieldSchema(name:antenna_6, type:string, comment:null), FieldSchema(name:scene, type:string, comment:null), FieldSchema(name:base_lon, type:double, comment:null), FieldSchema(name:base_lat, type:double, comment:null), FieldSchema(name:ssb_send_power, type:double, comment:null), FieldSchema(name:base_h_angle, type:double, comment:null), FieldSchema(name:antenna_height, type:double, comment:null), FieldSchema(name:m_vertical_angle, type:double, comment:null), FieldSchema(name:h_beam_precision, type:int, comment:null), FieldSchema(name:v_beam_precision, type:int, comment:null), FieldSchema(name:simu_spectrum, type:decimal(2,1), comment:null)], properties:null)
    INFO : Completed compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.045 seconds
    INFO : Executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1
    INFO : Completed executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.001 seconds
    INFO : OK
    Error: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://master01.hadoop.dtmobile.cn:8020/user/hive/warehouse/capacity.db/cell_random_grid_tmp2/part-00000-82a689a5-7c2a-48a0-ab17-8bf04c963ea6-c000.snappy.parquet (state=,code=0)
    0: jdbc:hive2://master01.hadoop.dtmobile.cn:1>

    通过spark2.3 sparksql saveAsTable()执行写数据到hive,sparksql写数据到hive时候,默认是保存为parquet+snappy的数据。在数据保存完成之后,通过hive beeline查询,报错如上。但是通过spark查询,执行正常。

    stackoverflow上找到同样的问题:

    根本原因如下:

    This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
    eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

    Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

    Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

    所以尝试调整参数 spark.sql.parquet.writeLegacyFormat = true,问题解决。

    到spark2.3源代码中查找该参数(spark.sql.parquet.writeLegacyFormat):

    package org.apache.spark.sql.internal 中 关于sparksql的默认配置 SQLConf.scala中相关描述如下

      val PARQUET_WRITE_LEGACY_FORMAT = buildConf("spark.sql.parquet.writeLegacyFormat")
        .doc("Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior " +
          "versions, when converting Parquet schema to Spark SQL schema and vice versa.")
        .booleanConf
        .createWithDefault(false)

    可以看到默认值为false

    在 package org.apache.spark.sql.execution.datasources.parquet 的关于ParquetWriteSupport.scala 的描述如下:

    /**
     * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet
     * messages.  This class can write Parquet data in two modes:
     *
     *  - Standard mode: Parquet data are written in standard format defined in parquet-format spec.
     *  - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
     *
     * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`.  The value
     * of this option is propagated to this class by the `init()` method and its Hadoop configuration
     * argument.
     */
  • 相关阅读:
    centos6 LVS-DR模式---分析
    centos6.6 安装 LXC
    Amoeba-mysql读写分离实战
    keepalived +mysql 实战
    nginx添加sticky模块-cookie保持会话
    haproxy转发真实IP给web
    Mysql-如何正确的使用索引以及索引的原理
    Mysql-自带的一些功能,基本用法(视图,触发器,事务,存储过程,函数,流程控制)
    Mysql-常用数据的基本操作和基本形式
    Mysql-多表连接的操作和用法
  • 原文地址:https://www.cnblogs.com/zz-ksw/p/11458121.html
Copyright © 2011-2022 走看看