zoukankan html css js c++ java

Apache Spark-1.0.0浅析（三）：资源调度——Job提交

基本概念：Job，Stage，Task，DagScheduler，TaskScheduler……

RDD的操作可以分为Transformations和Actions，Transformations是lazy的不立即执行，Action则会触发作业的提交和执行。例如本例中的foreach

def foreach(f: T => Unit) {
  sc.runJob(this, (iter: Iterator[T]) => iter.foreach(f))
}

一句话，Actions调用sc.runJob触发作业运行。

SparkContext中的runJob有多个版本的重载

foreach调用的版本，以rdd和func为参数，返回执行的结果

/**
   * Run a job on all partitions in an RDD and return the results in an array.
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.size, false)
  }

然后，进入下一个runJob，加入参数partitions和allowLocal

/**
   * Run a job on a given set of partitions of an RDD, but take a function of type
   * `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int],
      allowLocal: Boolean
      ): Array[U] = {
    runJob(rdd, (context: TaskContext, iter: Iterator[T]) => func(iter), partitions, allowLocal)
  }

之后调用下一个runJob，将结果返回到result数组中

/**
   * Run a function on a given set of partitions in an RDD and return the results as an array. The
   * allowLocal flag specifies whether the scheduler can run the computation on the driver rather
   * than shipping it out to the cluster, for short actions like first().
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      allowLocal: Boolean
      ): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, allowLocal, (index, res) => results(index) = res)
    results
  }

最后调用，参数中加入resultHandler句柄

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark. The allowLocal
   * flag specifies whether the scheduler can run the computation on the driver rather than
   * shipping it out to the cluster, for short actions like first().
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit) {
    if (dagScheduler == null) {
      throw new SparkException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite)
    val start = System.nanoTime
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
      resultHandler, localProperties.get)
    logInfo("Job finished: " + callSite + ", took " + (System.nanoTime - start) / 1e9 + " s")
    rdd.doCheckpoint()
  }

sc.runJob最终调用dagScheduler.runJob。

需要提到的一点是

val cleanedFunc = clean(func)

其作用在注释中

/**
   * Clean a closure to make it ready to serialized and send to tasks
   * (removes unreferenced variables in $outer's, updates REPL variables)
   */
  private[spark] def clean[F <: AnyRef](f: F): F = {
    ClosureCleaner.clean(f)
    f
  }

END

查看全文

相关阅读:
LeetCode.5-最长回文子串(Longest Palindromic Substring)
LeetCode.3-最长无重复字符子串(Longest Substring Without Repeating Characters)
2013 最新的 play web framework 版本 1.2.3 框架学习文档整理
 play framework学习笔记之模板引擎
 C# 枚举、字符串、值的相互转换
 styleCop
配置VS代码生成工具ReSharper快捷键
 StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It's All About Performance
开启Windows Server 2008 R2上帝模式
 微信支付实战（与支付宝和其他一些支付有些不一样）

原文地址：https://www.cnblogs.com/kevingu/p/4677196.html

Apache Spark-1.0.0浅析（三 ）：资源调度——Job提交

Apache Spark-1.0.0浅析（三）：资源调度——Job提交