SparkContext.setCheckpointDir()

zoukankan html css js c++ java

SparkContext.setCheckpointDir()

class SparkContext extends Logging with ExecutorAllocationClient

Main entry point for Spark functionality.

spark功能函数的主入口。

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

Distribute a local Scala collection to form an RDD.

将一个本地Scala collection 格式化为一个RDD。

Note

Parallelize acts lazily. If seq is a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this.

注意

Parallelize是懒动作函数.如果参数seq是一个易变的collection，并且在调用parallelize之后但又在一个对RDD的action之前的期间会被修改，那么所得的RDD将会反应出被修改的collection，导致结果可能会不可预料。所以，向本函数的参数seq传递一个副本。

checkpoint(self)

Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

checkpoint(self)

标记当前RDD的校验点。它会被保存为在由SparkContext.setCheckpointDir()方法设置的checkpoint目录下的文件集中的一个文件。简而言之就是当前RDD的校验点被保存为了一个文件，而这个文件在一个目录下，这个目录下有不少的这样的文件，这个目录是由SparkContext.setCheckpointDir()方法设置的。并且所有从父RDD中引用的文件都将被删除。这个函数必须在所有的job前被调用，运行在这个RDD上。它被强烈的建议保存在内存中，否则，也就是从内存转出存入文件，则需要重新计算它。

scala:

def setCheckpointDir(directory: String): Unit

Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.

设置一个目录，用来让RDD们可以在其下被checkpoint。如果是跑在一个集群上，这个目录必须是一个HDFS路径。

查看全文

相关阅读:
ntohs, ntohl, htons,htonl的比较和详解【转】
Device Tree 详解【转】
浅析Linux DeviceTree【转】
【spring boot】spring boot 拦截器
 【jQuery】jQuery/js 判断字符串是否JSON字符串
 【java】java中的 &= 和 |= 和 ^= 的区别
 zabbix创建触发器、action，发送报警邮件
 html iframe禁用右键
 mysql数据库mysqldump方式备份
 JDK8新特性

原文地址：https://www.cnblogs.com/suanec/p/4769768.html

SparkContext.setCheckpointDir()

class SparkContext extends Logging with ExecutorAllocationClient

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

def setCheckpointDir(directory: String): Unit