zoukankan      html  css  js  c++  java
  • Spark术语

    1.resilient distributed dataset (RDD)

    The core programming abstraction in Spark, consisting of a fault-tolerant collection of elements that can be operated on in parallel.

    2.partition

    A subset of the elements in an RDD. Partitions define the unit of parallelism;

    Spark processes elements within a partition in sequence and multiple partitions in parallel.

    When Spark reads a file from HDFS, it creates a single partition for a single input split.

    It returns a single partition for a single block of HDFS (but the split between partitions is on line split, not the block split), unless you have a compressed text file.

    In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).

    3.application

    A job, sequence of jobs, or a long-running service issuing new commands as needed or an interactive exploration session.

    4.application JAR

    A JAR containing a Spark application. In some cases you can use an "Uber" JAR containing your application along with its dependencies.

    The JAR should never include Hadoop or Spark libraries, however, these will be added at runtime.

    5.cluster manager

    An external service for acquiring resources on the cluster: Spark Standalone or YARN.

    6.job

    A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action.

    7.task

    A unit of work on a partition of a distributed dataset. Also referred to as a stage.

    8.driver

    Process that represents the application session.

    The driver is responsible for converting the application to a directed graph of individual steps to execute on the cluster.

    There is one driver per application.

    9.executor

    A process that serves a Spark application.

    An executor runs multiple tasks over its lifetime, and multiple tasks concurrently.

    A host may have several Spark executors and there are many hosts running Spark executors for each application.

    10.deploy mode

    Identifies where the driver process runs.

    In client mode, the submitter launches the driver outside of the cluster.

    In cluster mode, the framework launches the driver inside the cluster.

    Client mode is simpler, but cluster mode allows you to log out after starting a Spark application without terminating the application.

    12.Spark Standalone

    A model of running Spark applications in which a Master daemon coordinates the efforts of Worker daemons, which run the executors.

    13.Spark on YARN

    A model of running Spark applications in which the YARN ResourceManager performs the functions of the Spark Master.

    The functions of the Workers are performed by the YARN NodeManagers, which run the executors.

    14.ApplicationMaster

    A YARN role responsible for negotiating resource requests made by the driver and finding a set of containers in which to run the Spark application.

    There is one ApplicationMaster per application.

  • 相关阅读:
    理解vertical-align
    理解css行高(line-height)
    react 生命周期函数
    react Diff 算法
    React中的虚拟DOM
    无限重启:windows更新之后,在输入密码页面无限重启进入不了系统
    [转]github 上传project代码
    【转】HTTP响应状态码参考簿
    TweenMax—ScrambleText插件 实现类似电脑破译密码的特效
    既然CPU一次只能执行一个线程,那多线程存在的意义是什么?
  • 原文地址:https://www.cnblogs.com/liugh/p/6958528.html
Copyright © 2011-2022 走看看