Spark Programming Guide 2.2.0 笔记

zoukankan html css js c++ java

Spark Programming Guide 2.2.0 笔记
文档

http://spark.apache.org/docs/latest/rdd-programming-guide.html

初始化spark

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

RDD分区

Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster.

By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

RDD保存Object

RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

Spark传递函数

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
- Anonymous function syntax, which can be used for short pieces of code.
- Static methods in a global singleton object.
Although it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method.（不要传递类变量和类方法，这样会导致整个Object的传递，可以先把变量拷贝为本地变量）

Spark闭包

To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.（注意闭包是原有对象的拷贝）

Shuffle

Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

Accumulator

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.（行动操作对于accumulator的操作可以被保证有且只有一次，转化操作可能有多次）
查看全文

相关阅读:
Vue项目碰到"‘webpack-dev-server’不是内部或外部命令，也不是可运行的程序或批处理文件"报错
 PowerDesigner最基础的使用方法入门学习
 Centos7 上安装mysql遇上的问题：mysql无法正常启动
 微信小程序的Web API接口设计及常见接口实现
 模型数据作渲染优化时遇到的问题
 vertex compression所遇到的问题
 depth and distance
Linear or non-linear shadow maps?
实施vertex compression所遇到的各种问题和解决办法
 【转】編譯Ogre1.9 IOS Dependencies及Ogre Source步驟及相關注意事項…

原文地址：https://www.cnblogs.com/zcy-backend/p/7545341.html