zoukankan      html  css  js  c++  java
  • Spark Programming Guide 2.2.0 笔记

    文档

    http://spark.apache.org/docs/latest/rdd-programming-guide.html

    初始化spark

    Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

    RDD分区

    Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster.

    By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

    RDD保存Object

    RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

    Spark传递函数

    Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:

    Although it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method.(不要传递类变量和类方法,这样会导致整个Object的传递,可以先把变量拷贝为本地变量)

    Spark闭包

    To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.(注意闭包是原有对象的拷贝)

    Shuffle

    Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

    Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

    Accumulator

    For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.(行动操作对于accumulator的操作可以被保证有且只有一次,转化操作可能有多次)

  • 相关阅读:
    十大排序算法详解,基本思想+动画演示+C语言实现,太肝了!
    宛如一个未来穿越者,终年33岁的印度数学天才,大数学家哈代说“他发现并创造了数学”
    编程的相关概念
    android中ScrollView嵌套ListView或GridView显示位置问题
    炫酷MD风之dialog各种对话框
    (转载)new Thread的弊端及Java四种线程池的使用
    (转载)Android开发——Android中常见的4种线程池(保证你能看懂并理解)
    (转)android import library switch语句报错case expressions must be constant expressions
    eclipse中将一个项目作为library导入另一个项目中
    (转)Android四大组件——Activity跳转动画、淡出淡入、滑出滑入、自定义退出进入
  • 原文地址:https://www.cnblogs.com/zcy-backend/p/7545341.html
Copyright © 2011-2022 走看看