zoukankan html css js c++ java

spark.sql.shuffle.partitions和spark.default.parallelism的区别

在关于spark任务并行度的设置中，有两个参数我们会经常遇到，spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什么区别的？

首先，让我们来看下它们的定义

Property Name

Default

Meaning

spark.sql.shuffle.partitions

200

Configures the number of partitions to use when shuffling data for joins or aggregations.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.

For operations like parallelize with no parent RDDs, it depends on the cluster manager:
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

看起来它们的定义似乎也很相似，但在实际测试中，

spark.default.parallelism只有在处理RDD时才会起作用，对Spark SQL的无效。
spark.sql.shuffle.partitions则是对Spark SQL专用的设置

我们可以在提交作业的通过 --conf 来修改这两个设置的值，方法如下：

spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20

查看全文

相关阅读:
从搜索引擎角度看SEO
关键词排名与网站优化有哪三大误区？
真正提升关键词排名的外链应该怎样发？
高质量外链的十大特性
 四个方面分析SEO如何提高网站的权重
 Linux(ubuntu)使用dd从iso制作win7安装u盘（读卡器一样），以及备份分区
 折腾slidingmenu
生命游戏介绍
 21232f297a57a5a743894a0e4a801fc3
final关键字

原文地址：https://www.cnblogs.com/itboys/p/10960614.html