zoukankan      html  css  js  c++  java
  • Spark 大数据平台

    Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

    BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.

    Berkeley Data Analytics Stack

    Vision of spark

    Spark Components VS. Hadoop Components
    Spark Core <------> Apache Hadoop MR
    Spark Streaming <------> Apache Storm
    Spark SQL <------> Apache Hive
    Spark GraphX <------> MPI(taobao)
    Spark MLlib <------> Apache Mahout

    BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to +, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars.
    Two key ideas:

    • An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time
    • A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements.

    Why spark is fast:

    • in-memory computing
    • Directed Acyclic Graph (DAG) engine, compiler can see the whole computing graph in advance so that it can optimize it. Delay Scheduling

    Resilient Distributed Dataset

    • A list of partitions
    • A function for computing each split
    • A list of dependencies on other RDDs
    • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
    • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

    Storage Strategy

    class StorageLevel private(
        private var useDisk_ : Boolean,
        private var useMemory_ : Boolean,
        private var deserialized_ : Boolean,
        private var replication_ : Int = 1)
        
    val MEMORY_ONLY_ = new StorageLevel(false, true, true)
    

    RDD, transformation & action

    lazy evaluation
    transformation and actions

  • 相关阅读:
    ZZZZ
    linux expect, spawn用法小记
    小议common lisp程序开发流程
    解决编译apache出现的问题:configure: error: APR not found . Please read the documentation
    SMART原则_百度百科
    心态不够青春,所以身上的技术也容易衰老
    What is tradebit?
    About VirtualBoxImages.com
    ssh-copy-id -i ~/.ssh/id_rsa.pub admin@172.17.42.66
    香港mtmit真皮休闲商务双用时尚浮点手拿包1018 烟灰色-大号 均码【图片 价格 品牌 报价】-京东商城
  • 原文地址:https://www.cnblogs.com/rainbow203/p/Spark-da-shu-ju-ping-tai.html
Copyright © 2011-2022 走看看