zoukankan      html  css  js  c++  java
  • spark hadoop 对比 Resilient Distributed Datasets

    hadoop 迭代消耗大 每次迭代启动一个完整的MapReduce作业

    spark 首要目标就是避免运算时 过多的网络和磁盘IO开销 

     Resilient Distributed Datasets

    http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/spark.pdf

    Resilient Distributed Datasets
    Presented by Henggang Cui
    15799b Talk
    1
    Why not MapReduce
    • Provide fault-tolerance, but:
    • Hard to reuse intermediate results across
    multiple computations
    – stable storage for sharing data across jobs
    • Hard to support interactive ad-hoc queries
    2
    Why not Other In-Memory Storage
    • Examples: Piccolo
    – Apply fine-grained updates to shared states
    • Efficient, but:
    • Hard to provide fault-tolerance
    – need replication or checkpointing
    3
    Resilient Distributed Datasets (RDDs)
    • Restricted form of distributed shared memory
    – read-only, partitioned collection of records
    – can only be built through coarse‐grained
    deterministic transformations
    • data in stable storage
    • transformations from other RDDs.
    • Express computation by
    – defining RDDs
    4
    Fault Recovery
    • Efficient fault recovery using lineage
    – log one operation to apply to many elements
    (lineage)
    – recompute lost partitions on failure
    5
    Example
    lines = spark.textFile("hdfs://...")
    errors = lines.filter(_.startsWith("ERROR"))
    hdfs_errors = errors.filter(_.contains(“HDFS"))
    6
    Advantages of the RDD Model
    • Efficient fault recovery
    – fine-grained and low-overhead using lineage
    • Immutable nature can mitigate stragglers
    – backup tasks to mitigate stragglers
    • Graceful degradation when RAM is not
    enough
    7
    Spark
    • Implementation of the RDD abstraction
    – Scala interface
    • Two components
    – Driver
    – Workers
    8
    • Driver
    – defines and invokes actions on RDDs
    – tracks the RDDs’ lineage
    • Workers
    – store RDD partitions
    – perform RDD
    transformations
    Spark Runtime
    9
    Supported RDD Operations
    • Transformations
    – map (f: T->U)
    – filter (f: T->Bool)
    – join()
    – ... (and lots of others)
    • Actions
    – count()
    – save()
    – ... (and lots of others)
    10
    Representing RDDs
    • A graph-based representation for RDDs
    • Pieces of information for each RDD
    – a set of partitions
    – a set of dependencies on parent RDDs
    – a function for computing it from its parents
    – metadata about its partitioning scheme and data
    placement
    11
    RDD Dependencies
    • Narrow dependencies
    – each partition of the parent RDD is used by at
    most one partition of the child RDD
    • Wide dependencies
    – multiple child partitions may depend on it
    12
    RDD Dependencies
    13
    RDD Dependencies
    • Narrow dependencies
    – allow for pipelined execution on one cluster node
    – easy fault recovery
    • Wide dependencies
    – require data from all parent partitions to be
    available and to be shuffled across the nodes
    – a single failed node might cause a complete reexecution.
    14
    Job Scheduling
    • To execute an action on an RDD
    – scheduler decide the stages from the RDD’s
    lineage graph
    – each stage contains as many pipelined
    transformations with narrow dependencies as
    possible
    15
    Job Scheduling
    16
    Memory Management
    • Three options for persistent RDDs
    – in-memory storage as deserialized Java objects
    – in-memory storage as serialized data
    – on-disk storage
    • LRU eviction policy at the level of RDDs
    – when there’s not enough memory, evict a
    partition from the least recently accessed RDD
    17
    Checkpointing
    • Checkpoint RDDs to prevent long lineage
    chains during fault recovery
    • Simpler to checkpoint than shared memory
    – Read-only nature of RDDs
    18
    Discussions
    19
    Checkpointing or Versioning?
    20
    • Frequent checkpointing, or
    Keep all versions of ranks?

  • 相关阅读:
    手写一个类django框架
    Django基础知识
    JQuery知识点总结
    javascript知识点整理
    html知识点一
    mysql之sql语句
    通过非IO阻塞模型实现ftp并发的小代码
    python学习第三十三节(IO模型)
    python学习第三十二节(进程间通信、进程池、协程)
    IntelliJ IDEA For Mac 快捷键
  • 原文地址:https://www.cnblogs.com/rsapaper/p/9059148.html
Copyright © 2011-2022 走看看