zoukankan      html  css  js  c++  java
  • ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...

    0: jdbc:hive2://master01.hadoop.dtmobile.cn:1> select * from cell_random_grid_tmp2 limit 1;
    INFO : Compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1
    INFO : Semantic Analysis Completed
    INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:grid_row_id, type:int, comment:null), FieldSchema(name:grid_col_id, type:int, comment:null), FieldSchema(name:google_gri, type:int, comment:null), FieldSchema(name:google_gci, type:int, comment:null), FieldSchema(name:user_lon, type:double, comment:null), FieldSchema(name:user_lat, type:double, comment:null), FieldSchema(name:grid_type, type:int, comment:null), FieldSchema(name:grid_height, type:int, comment:null), FieldSchema(name:compute_region_name, type:string, comment:null), FieldSchema(name:antenna_0, type:string, comment:null), FieldSchema(name:antenna_1, type:string, comment:null), FieldSchema(name:antenna_2, type:string, comment:null), FieldSchema(name:antenna_3, type:string, comment:null), FieldSchema(name:antenna_4, type:string, comment:null), FieldSchema(name:antenna_5, type:string, comment:null), FieldSchema(name:antenna_6, type:string, comment:null), FieldSchema(name:scene, type:string, comment:null), FieldSchema(name:base_lon, type:double, comment:null), FieldSchema(name:base_lat, type:double, comment:null), FieldSchema(name:ssb_send_power, type:double, comment:null), FieldSchema(name:base_h_angle, type:double, comment:null), FieldSchema(name:antenna_height, type:double, comment:null), FieldSchema(name:m_vertical_angle, type:double, comment:null), FieldSchema(name:h_beam_precision, type:int, comment:null), FieldSchema(name:v_beam_precision, type:int, comment:null), FieldSchema(name:simu_spectrum, type:decimal(2,1), comment:null)], properties:null)
    INFO : Completed compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.045 seconds
    INFO : Executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1
    INFO : Completed executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.001 seconds
    INFO : OK
    Error: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://master01.hadoop.dtmobile.cn:8020/user/hive/warehouse/capacity.db/cell_random_grid_tmp2/part-00000-82a689a5-7c2a-48a0-ab17-8bf04c963ea6-c000.snappy.parquet (state=,code=0)
    0: jdbc:hive2://master01.hadoop.dtmobile.cn:1>

    通过spark2.3 sparksql saveAsTable()执行写数据到hive,sparksql写数据到hive时候,默认是保存为parquet+snappy的数据。在数据保存完成之后,通过hive beeline查询,报错如上。但是通过spark查询,执行正常。

    stackoverflow上找到同样的问题:

    根本原因如下:

    This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
    eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

    Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

    Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

    所以尝试调整参数 spark.sql.parquet.writeLegacyFormat = true,问题解决。

    到spark2.3源代码中查找该参数(spark.sql.parquet.writeLegacyFormat):

    package org.apache.spark.sql.internal 中 关于sparksql的默认配置 SQLConf.scala中相关描述如下

      val PARQUET_WRITE_LEGACY_FORMAT = buildConf("spark.sql.parquet.writeLegacyFormat")
        .doc("Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior " +
          "versions, when converting Parquet schema to Spark SQL schema and vice versa.")
        .booleanConf
        .createWithDefault(false)

    可以看到默认值为false

    在 package org.apache.spark.sql.execution.datasources.parquet 的关于ParquetWriteSupport.scala 的描述如下:

    /**
     * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet
     * messages.  This class can write Parquet data in two modes:
     *
     *  - Standard mode: Parquet data are written in standard format defined in parquet-format spec.
     *  - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
     *
     * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`.  The value
     * of this option is propagated to this class by the `init()` method and its Hadoop configuration
     * argument.
     */
  • 相关阅读:
    哎,这两天的心情,真是太差了
    昨天跌停价冲进去,今天继续跌停
    好几天没有干正事了,是否已经堕落了?
    继续通宵加班
    Map集合的遍历方式
    List,Set,Map,propertes注入方式
    java异常捕获以及处理
    List、Set、Map、数组互转
    Java并发编程:深入剖析ThreadLocal
    Exceotion与RuntimeException的区别
  • 原文地址:https://www.cnblogs.com/zz-ksw/p/11458121.html
Copyright © 2011-2022 走看看