zoukankan      html  css  js  c++  java
  • spark coalesce和repartition的区别和使用场景

    区别:

    repartition底层调用的是coalesce方法,默认shuffle

    def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
    }

    coalesce方法的shuffle参数默认为false,默认不shuffle

    def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
        : RDD[T] = withScope {
      if (shuffle) {
        /** Distributes elements evenly across output partitions, starting from a random partition. */
        val distributePartition = (index: Int, items: Iterator[T]) => {
          var position = (new Random(index)).nextInt(numPartitions)
          items.map { t =>
            // Note that the hash code of the key will just be the key itself. The HashPartitioner
            // will mod it with the number of total partitions.
            position = position + 1
            (position, t)
          }
        } : Iterator[(Int, T)]
     
        // include a shuffle step so that our upstream tasks are still distributed
        new CoalescedRDD(
          new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
          new HashPartitioner(numPartitions)),
          numPartitions).values
      } else {
        new CoalescedRDD(this, numPartitions)
      }
    }

     

    使用场景:

    如果你减少分区数,考虑使用coalesce,这样可以避免执行shuffle。但是假如内存不够用,可能会引起内存溢出。

  • 相关阅读:
    mina中的发送延时
    微服务理论之五:微服务架构 vs. SOA架构
    同步
    JAVA中线程同步的方法(7种)汇总
    http连接管理
    MySQL存储引擎比较
    ZAB与Paxos算法的联系与区别
    syslog之二:syslog协议及rsyslog服务全解析
    微服务理论之六:ESB与SOA的关系
    DBCP连接池原理分析及配置用法
  • 原文地址:https://www.cnblogs.com/Alcesttt/p/11386049.html
Copyright © 2011-2022 走看看