zoukankan      html  css  js  c++  java
  • 【spark】spark2升级到spark3,spark3中的包变动记录

    背景:

    spark3新增动态裁剪。现尝试将spark2升级到spark3

    当前版本:spark 2.4.1,scala 2.11.12

    目标版本:spark 3.1.1,  scala 2.12.13

    异常记录:

    • 异常1
    java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWriteSupport

    出问题的包

     <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
        <version>2.4.1</version>
    </dependency>

    修正后

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
        <version>3.0.0</version>
    </dependency>

    异常原因:

    spark3.0中的org.apache.spark.sql.sources.DataSourceRegister中serviceLoader加载的类为

    org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2
    org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
    org.apache.spark.sql.execution.datasources.v2.json.JsonDataSourceV2
    org.apache.spark.sql.execution.datasources.noop.NoopDataSource
    org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
    org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2
    org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
    org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
    org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
    org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider
    org.apache.spark.sql.execution.datasources.binaryfile.BinaryFileFormat

    对比之前spark2中

    org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
    org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
    org.apache.spark.sql.execution.datasources.json.JsonFileFormat
    org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
    org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
    org.apache.spark.sql.execution.datasources.text.TextFileFormat
    org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
    org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
    org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider

    发现部分的Source已发生改变。追踪下来 org/apache/spark/sql/sources 下的v2包都没了

    spark2中的KafkaSourceProvider

    private[kafka010] class KafkaSourceProvider extends DataSourceRegister
        with StreamSourceProvider
        with StreamSinkProvider
        with RelationProvider
        with CreatableRelationProvider
        with StreamWriteSupport
        with ContinuousReadSupport
        with MicroBatchReadSupport
        with Logging {
      import KafkaSourceProvider._

    spark3中的KafkaSourceProvider

    private[kafka010] class KafkaSourceProvider extends DataSourceRegister
        with StreamSourceProvider
        with StreamSinkProvider
        with RelationProvider
        with CreatableRelationProvider
        with SimpleTableProvider
        with Logging {
      import KafkaSourceProvider._

    • 异常2

    目前vertica提供的spark暂不支持3.0,需要通过jdbc方式重新实现一版

    • 异常3
    java.lang.String cannot be cast to java.time.ZonedDateTime

    异常源:

    <dependency>
        <groupId>com.github.housepower</groupId>
        <artifactId>clickhouse-integration-spark_2.12</artifactId>
        <version>2.5.4</version>
    </dependency>

    建表语句:

    create table default.zwy_test (time DateTime,AMP Float64,NOZP Int32,value Int32,reason String ) ENGINE = MergeTree order by time

    写入数据的schema:

    root
     |-- time: string (nullable = true)
     |-- AMP: double (nullable = true)
     |-- NOZP: integer (nullable = true)
     |-- value: integer (nullable = true)
     |-- reason: string (nullable = true)

    异常原因:

    在Spark 3.0中,将值插入具有不同数据类型的表列中时,将根据ANSI SQL标准执行类型强制转换。标准SQL的转换规则参考,其中String转日期已经不属于隐式转换,而且spark2中String会自动转换为日期类型。因此spark2升级到spark3中,需要对String类型通过from_utc_timestamp等函数显式地转换

    • 变动1

      jdbc spark3增加keytab,principal参数,支持kerberos了

    spark2到3的变更记录 https://spark.apache.org/docs/3.0.0/core-migration-guide.html

  • 相关阅读:
    Python包中__init__.py作用
    获取web页面xpath
    Selenium学习(Python)
    C++构造函数的选择
    分布式实时处理系统——C++高性能编程
    构建之法(邹欣)
    分布式实时处理系统——通信基础
    go语言-csp模型-并发通道
    redis.conf 配置说明
    Linux fork()一个进程内核态的变化
  • 原文地址:https://www.cnblogs.com/zhouwenyang/p/14654552.html
Copyright © 2011-2022 走看看