1. spark-submit脚本
exec $SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit "${ORIG_ARGS[@]}"
2. SparkSubmit中的main函数
def main(args: Array[String]): Unit = { val appArgs = new SparkSubmitArguments(args) if (appArgs.verbose) { printStream.println(appArgs) } appArgs.action match { case SparkSubmitAction.SUBMIT => submit(appArgs) case SparkSubmitAction.KILL => kill(appArgs) case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs) } }
Internally, each RDD is characterized by five main properties:
- A list of partitions
//Implemented by subclasses to return the set of partitions in this RDD. This method will only be called once, so it is safe to implement a time-consuming computation in it. protected def getPartitions: Array[Partition]
- A function for computing each split
//Implemented by subclasses to compute a given partition. def compute(split: Partition, context: TaskContext): Iterator[T]
- A list of dependencies on other RDDs
//Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only be called once, so it is safe to implement a time-consuming computation in it. protected def getDependencies: Seq[Dependency[_]] = deps
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
//Optionally overridden by subclasses to specify how they are partitioned. @transient val partitioner: Option[Partitioner] = None
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
//Optionally overridden by subclasses to specify placement preferences. protected def getPreferredLocations(split: Partition): Seq[String] = Nil