zoukankan      html  css  js  c++  java
  • RDD的基础知识



    A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.





    2、用户可以在创建RDD时指定RDD的分区数量,如果没有指定,那么就会采用默认值,即程序分区到的CPU core数目。对于HDFS,每个block会分配一个分区。对于由父RDD生成的子RDD,其分区数量与父RDD相同,或者在transformation中显式指定。




    val rdd = sc.paralellize(List(1,2,3,4))



    比如本地文件,HDFS, Hbase等,常用textFile方法

    val rdd = sc.textFile("hdfs:///tmp/myfile.txt")


    RDD有2种操作:transformation 与 action






    MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level. 
    MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed. 
    MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. 
    MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed. 
    DISK_ONLY Store the RDD partitions only on disk. 
    MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes. 
    OFF_HEAP Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory. If you plan to use Tachyon as the off heap store, Spark is compatible with Tachyon out-of-the-box. Please refer to this page for the suggested version pairings.




  • 相关阅读:
    vue 封装http请求时错误信息提示使用element-ui message,只提示一次
    angular8 Vue 导出excel文件
    python3 tornado api + angular8 + nginx 跨域问题
    ubutu tornado python3.7.5 nginx supervisor 部署web api
    angular cli 反向代理实现跨域
    angular cli 使用echarts
  • 原文地址:https://www.cnblogs.com/itboys/p/6673290.html
Copyright © 2011-2022 走看看