zoukankan html css js c++ java

【spark】spark2升级到spark3，spark3中的包变动记录

背景:

spark3新增动态裁剪。现尝试将spark2升级到spark3

当前版本：spark 2.4.1，scala 2.11.12

目标版本：spark 3.1.1, scala 2.12.13

异常记录:

异常1

java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWriteSupport

出问题的包

 <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
    <version>2.4.1</version>
</dependency>

修正后

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
    <version>3.0.0</version>
</dependency>

异常原因:

spark3.0中的org.apache.spark.sql.sources.DataSourceRegister中serviceLoader加载的类为

org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.v2.json.JsonDataSourceV2
org.apache.spark.sql.execution.datasources.noop.NoopDataSource
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2
org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider
org.apache.spark.sql.execution.datasources.binaryfile.BinaryFileFormat

对比之前spark2中

org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.sources.RateStreamProvider
org.apache.spark.sql.execution.streaming.sources.TextSocketSourceProvider

发现部分的Source已发生改变。追踪下来 org/apache/spark/sql/sources 下的v2包都没了

spark2中的KafkaSourceProvider

private[kafka010] class KafkaSourceProvider extends DataSourceRegister
    with StreamSourceProvider
    with StreamSinkProvider
    with RelationProvider
    with CreatableRelationProvider
    with StreamWriteSupport
    with ContinuousReadSupport
    with MicroBatchReadSupport
    with Logging {
  import KafkaSourceProvider._

spark3中的KafkaSourceProvider

private[kafka010] class KafkaSourceProvider extends DataSourceRegister
    with StreamSourceProvider
    with StreamSinkProvider
    with RelationProvider
    with CreatableRelationProvider
    with SimpleTableProvider
    with Logging {
  import KafkaSourceProvider._

异常2

目前vertica提供的spark暂不支持3.0，需要通过jdbc方式重新实现一版

异常3

java.lang.String cannot be cast to java.time.ZonedDateTime

异常源:

<dependency>
    <groupId>com.github.housepower</groupId>
    <artifactId>clickhouse-integration-spark_2.12</artifactId>
    <version>2.5.4</version>
</dependency>

建表语句:

create table default.zwy_test (time DateTime,AMP Float64,NOZP Int32,value Int32,reason String ) ENGINE = MergeTree order by time

写入数据的schema:

root
 |-- time: string (nullable = true)
 |-- AMP: double (nullable = true)
 |-- NOZP: integer (nullable = true)
 |-- value: integer (nullable = true)
 |-- reason: string (nullable = true)

异常原因:

在Spark 3.0中，将值插入具有不同数据类型的表列中时，将根据ANSI SQL标准执行类型强制转换。标准SQL的转换规则参考，其中String转日期已经不属于隐式转换，而且spark2中String会自动转换为日期类型。因此spark2升级到spark3中，需要对String类型通过from_utc_timestamp等函数显式地转换

变动1

　　jdbc spark3增加keytab，principal参数，支持kerberos了

spark2到3的变更记录 https://spark.apache.org/docs/3.0.0/core-migration-guide.html

查看全文

相关阅读:
文件的类型
 读取文件，并按原格式输出文件内容的三种方式
 react hook代码框架
 器具的行为模式
 设计模式
 cpu 内存机器语言汇编高级语言平台之间的关系
 操作系统之内存
 操作系统之文件
 操作系统之IO
七层模型之应用层

原文地址：https://www.cnblogs.com/zhouwenyang/p/14654552.html