zoukankan      html  css  js  c++  java
  • Spark Programming Guide 2.2.0 笔记

    文档

    http://spark.apache.org/docs/latest/rdd-programming-guide.html

    初始化spark

    Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

    RDD分区

    Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster.

    By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

    RDD保存Object

    RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

    Spark传递函数

    Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:

    Although it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method.(不要传递类变量和类方法,这样会导致整个Object的传递,可以先把变量拷贝为本地变量)

    Spark闭包

    To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.(注意闭包是原有对象的拷贝)

    Shuffle

    Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

    Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

    Accumulator

    For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.(行动操作对于accumulator的操作可以被保证有且只有一次,转化操作可能有多次)

  • 相关阅读:
    P1536 村村通 题解
    P1551 亲戚题解
    P1185 绘制二叉树 题解
    P3884 [JLOI2009]二叉树问题
    P1087 [NOIP2004 普及组] FBI 树
    P1305 新二叉树题解
    P1229 遍历问题
    P1030 [NOIP2001 普及组] 求先序排列题解
    P1827 [USACO3.4]美国血统 American Heritage 题解
    深度优先搜索dfs 讲解教程
  • 原文地址:https://www.cnblogs.com/zcy-backend/p/7545341.html
Copyright © 2011-2022 走看看