zoukankan      html  css  js  c++  java
  • Spark Programming Guide 2.2.0 笔记

    文档

    http://spark.apache.org/docs/latest/rdd-programming-guide.html

    初始化spark

    Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

    RDD分区

    Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster.

    By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

    RDD保存Object

    RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

    Spark传递函数

    Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:

    Although it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method.(不要传递类变量和类方法,这样会导致整个Object的传递,可以先把变量拷贝为本地变量)

    Spark闭包

    To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.(注意闭包是原有对象的拷贝)

    Shuffle

    Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

    Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

    Accumulator

    For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.(行动操作对于accumulator的操作可以被保证有且只有一次,转化操作可能有多次)

  • 相关阅读:
    Servant:基于Web的IIS管理工具
    mono-3.4.0 源码安装时出现的问题 [do-install] Error 2 [install-pcl-targets] Error 1 解决方法
    使用 OWIN Self-Host ASP.NET Web API 2
    Xamarin和微软发起.NET基金会
    SQLite vs MySQL vs PostgreSQL:关系型数据库比较
    Mono 3.2.7发布,JIT和GC进一步改进
    如何使用Microsoft技术栈
    c#开源消息队列中间件EQueue 教程
    通过一组RESTful API暴露CQRS系统功能
    NEsper Nuget包
  • 原文地址:https://www.cnblogs.com/zcy-backend/p/7545341.html
Copyright © 2011-2022 走看看