zoukankan      html  css  js  c++  java
  • Apache Spark-1.0.0浅析(三 ):资源调度——Job提交

    基本概念:Job,Stage,Task,DagScheduler,TaskScheduler……

    RDD的操作可以分为Transformations和Actions,Transformations是lazy的不立即执行,Action则会触发作业的提交和执行。例如本例中的foreach

    def foreach(f: T => Unit) {
      sc.runJob(this, (iter: Iterator[T]) => iter.foreach(f))
    }

    一句话,Actions调用sc.runJob触发作业运行。

    SparkContext中的runJob有多个版本的重载

    foreach调用的版本,以rdd和func为参数,返回执行的结果

    /**
       * Run a job on all partitions in an RDD and return the results in an array.
       */
      def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
        runJob(rdd, func, 0 until rdd.partitions.size, false)
      }

    然后,进入下一个runJob,加入参数partitions和allowLocal

    /**
       * Run a job on a given set of partitions of an RDD, but take a function of type
       * `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
       */
      def runJob[T, U: ClassTag](
          rdd: RDD[T],
          func: Iterator[T] => U,
          partitions: Seq[Int],
          allowLocal: Boolean
          ): Array[U] = {
        runJob(rdd, (context: TaskContext, iter: Iterator[T]) => func(iter), partitions, allowLocal)
      }

    之后调用下一个runJob,将结果返回到result数组中

    /**
       * Run a function on a given set of partitions in an RDD and return the results as an array. The
       * allowLocal flag specifies whether the scheduler can run the computation on the driver rather
       * than shipping it out to the cluster, for short actions like first().
       */
      def runJob[T, U: ClassTag](
          rdd: RDD[T],
          func: (TaskContext, Iterator[T]) => U,
          partitions: Seq[Int],
          allowLocal: Boolean
          ): Array[U] = {
        val results = new Array[U](partitions.size)
        runJob[T, U](rdd, func, partitions, allowLocal, (index, res) => results(index) = res)
        results
      }

    最后调用,参数中加入resultHandler句柄

    /**
       * Run a function on a given set of partitions in an RDD and pass the results to the given
       * handler function. This is the main entry point for all actions in Spark. The allowLocal
       * flag specifies whether the scheduler can run the computation on the driver rather than
       * shipping it out to the cluster, for short actions like first().
       */
      def runJob[T, U: ClassTag](
          rdd: RDD[T],
          func: (TaskContext, Iterator[T]) => U,
          partitions: Seq[Int],
          allowLocal: Boolean,
          resultHandler: (Int, U) => Unit) {
        if (dagScheduler == null) {
          throw new SparkException("SparkContext has been shutdown")
        }
        val callSite = getCallSite
        val cleanedFunc = clean(func)
        logInfo("Starting job: " + callSite)
        val start = System.nanoTime
        dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
          resultHandler, localProperties.get)
        logInfo("Job finished: " + callSite + ", took " + (System.nanoTime - start) / 1e9 + " s")
        rdd.doCheckpoint()
      }

    sc.runJob最终调用dagScheduler.runJob。

    需要提到的一点是

    val cleanedFunc = clean(func)

    其作用在注释中

    /**
       * Clean a closure to make it ready to serialized and send to tasks
       * (removes unreferenced variables in $outer's, updates REPL variables)
       */
      private[spark] def clean[F <: AnyRef](f: F): F = {
        ClosureCleaner.clean(f)
        f
      }

    END

  • 相关阅读:
    spin_lock &amp; mutex_lock的差别?
    Java拾遗(一):浅析Java子类和父类的实例化顺序 及 陷阱
    Android ViewPager使用具体解释
    大数运算
    fragment 中利用spinner实现省市联动
    秒杀多线程第四篇 一个经典的多线程同步问题
    Ewebeditor最新漏洞及漏洞大全
    轻松设置百度搜索手写输入
    Rational Rose 2007 &amp;Rational Rose 2003 下载及破解方法和汉化文件下载
    svm中的数学和算法
  • 原文地址:https://www.cnblogs.com/kevingu/p/4677196.html
Copyright © 2011-2022 走看看