zoukankan      html  css  js  c++  java
  • spark.sql.shuffle.partitions 和 spark.default.parallelism 的区别

    在关于spark任务并行度的设置中,有两个参数我们会经常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什么区别的?

    首先,让我们来看下它们的定义

    Property NameDefaultMeaning
    spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.
    spark.default.parallelism For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.

    For operations like parallelize with no parent RDDs, it depends on the cluster manager:
    - Local mode: number of cores on the local machine
    - Mesos fine grained mode: 8
    - Others: total number of cores on all executor nodes or 2, whichever is larger
    Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

    看起来它们的定义似乎也很相似,但在实际测试中,

    • spark.default.parallelism只有在处理RDD时才会起作用,对Spark SQL的无效。
    • spark.sql.shuffle.partitions则是对sparks SQL专用的设置
  • 相关阅读:
    [iOS基础控件
    [iOS基础控件
    后端程序员必会常用Linux命令总结
    MySQL数据库SQL语句基本操作
    MySQL拓展操作
    http/1.0/1.1/2.0与https的比较
    http中的socket是怎么一回事
    Django content_type 简介及其应用
    WEB主流框架技术(汇聚页)
    WEB基础技术(汇聚页)
  • 原文地址:https://www.cnblogs.com/lestatzhang/p/10611324.html
Copyright © 2011-2022 走看看