zoukankan      html  css  js  c++  java
  • Spark开发的完整基础_欢乐的马小纪

     map是对每个元素操作, mapPartitions是对其中的每个partition操作

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    mapPartitionsWithIndex : 把每个partition中的分区号和对应的值拿出来, 看源码

    val func = (index: Int, iter: Iterator[(Int)]) => {

      iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator

    }

    val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 2)

    rdd1.mapPartitionsWithIndex(func).collect

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    aggregate

    def func1(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {

      iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator

    }

    val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 2)

    rdd1.mapPartitionsWithIndex(func1).collect

    ###是action操作,柯理化 第一个参数是初始值, 二:是2个函数[每个函数都是2个参数(第一个参数:先对个个分区进行合并, 第二个:对个个分区合并后的结果再进行合并), 输出一个参数]

    ###0 + (0+1+2+3+4   +   0+5+6+7+8+9)

    rdd1.aggregate(0)(_+_, _+_)

    rdd1.aggregate(0)(math.max(_, _), _ + _)

    ###5和1比, 得5再和234比得5 --> 5和6789比,得9 --> 5 + (5+9)

    rdd1.aggregate(5)(math.max(_, _), _ + _)

    val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),2)

    def func2(index: Int, iter: Iterator[(String)]) : Iterator[String] = {

      iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator

    }

    rdd2.aggregate("")(_ + _, _ + _)

    rdd2.aggregate("=")(_ + _, _ + _)

    val rdd3 = sc.parallelize(List("12","23","345","4567"),2)

    rdd3.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)

    val rdd4 = sc.parallelize(List("12","23","345",""),2)

    rdd4.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)

    两个分区

    1.("","12","23")->("0","23")->("1")

    2. ("","345","")  ->("0","")  ->("0")

    val rdd5 = sc.parallelize(List("12","23","","345"),2)

    rdd5.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)

    两个分区

    1.("","12","23")->("0","23")->("1")

    2. ("","","345")  ->("1","")  ->("1")

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    aggregateByKey  和 reduceByKey基本一样,区别是它同于combiner

    val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

    def func2(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {

      iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator

    }

    pairRDD.mapPartitionsWithIndex(func2).collect

    pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect

    pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    checkpoint

    sc.setCheckpointDir("hdfs://node-1.itcast.cn:9000/ck")

    val rdd = sc.textFile("hdfs://node-1.itcast.cn:9000/wc").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)

    rdd.checkpoint

    rdd.isCheckpointed

    rdd.count

    rdd.isCheckpointed

    rdd.getCheckpointFile

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    coalesce, repartition

    val rdd1 = sc.parallelize(1 to 10, 10)

    val rdd2 = rdd1.coalesce(2, false)

    rdd2.partitions.length

    coalesce等同于repartition,第二个参数指的是否进行shuffle,

    repartition方法就是调用coalesce方法,-----repartition(a)等同于coalesce(a,true)

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    collectAsMap : Map(b -> 2, a -> 1)

    val rdd = sc.parallelize(List(("a", 1), ("b", 2)))

    rdd.collectAsMap

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    combineByKey : 和reduceByKey是相同的效果

    ###第一个参数x:原封不动取出来, 第二个参数:是函数, 局部运算, 第三个:是函数, 对局部运算后的结果再做运算

    ###每个分区中每个key中value中的第一个值, (hello,1)(hello,1)(good,1)-->(hello(1,1),good(1))-->x就相当于hello的第一个1, good中的1

    val rdd1 = sc.textFile("hdfs://master:9000/wordcount/input/").flatMap(_.split(" ")).map((_, 1))

    val rdd2 = rdd1.combineByKey(x => x, (a: Int, b: Int) => a + b, (m: Int, n: Int) => m + n)

    rdd1.collect

    rdd2.collect

    ###当input下有3个文件时(有3个block块, 不是有3个文件就有3个block, ), 每个会多加3个10

    val rdd3 = rdd1.combineByKey(x => x + 10, (a: Int, b: Int) => a + b, (m: Int, n: Int) => m + n)

    rdd3.collect

    val rdd4 = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)

    val rdd5 = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

    val rdd6 = rdd5.zip(rdd4)

    val rdd7 = rdd6.combineByKey(List(_), (x: List[String], y: String) => x :+ y, (m: List[String], n: List[String]) => m ++ n)

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    countByKey 

    val rdd1 = sc.parallelize(List(("a", 1), ("b", 2), ("b", 2), ("c", 2), ("c", 1)))

    rdd1.countByKey--------------Map(a->1,b->2,c->2)

    rdd1.countByValue------------Map(("a", 1)->1,("b", 2)->2,("c", 2)->1,("c", 1)->1)

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    filterByRange

    val rdd1 = sc.parallelize(List(("e", 5), ("c", 3), ("d", 4), ("c", 2), ("a", 1)))

    val rdd2 = rdd1.filterByRange("b", "d")

    rdd2.collect

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    flatMapValues  :  Array((a,1), (a,2), (b,3), (b,4))

    val rdd3 = sc.parallelize(List(("a", "1 2"), ("b", "3 4")))

    val rdd4 = rdd3.flatMapValues(_.split(" "))--------------------------Array((a,1), (a,2), (b,3), (b,4))

    rdd4.collect

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    foldByKey 

    val rdd1 = sc.parallelize(List("dog", "wolf", "cat", "bear"), 2)

    val rdd2 = rdd1.map(x => (x.length, x))

    val rdd3 = rdd2.foldByKey("")(_+_)---------------((3,dogcat),(4,wolf,bear))

    val rdd = sc.textFile("hdfs://node-1.itcast.cn:9000/wc").flatMap(_.split(" ")).map((_, 1))

    rdd.foldByKey(0)(_+_)

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    foreachPartition  action操作,虽然不能返回RDD,但是可以在里面对分区进行操作

    val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)

    rdd1.foreachPartition(x => println(x.reduce(_ + _)))  

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    keyBy : 以传入的参数做key

    val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)

    val rdd2 = rdd1.keyBy(_.length)

    val rdd2 = rdd1.keyBy(_(0))

    rdd2.collect

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

    keys values

    val rdd1 = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)

    val rdd2 = rdd1.map(x => (x.length, x))

    rdd2.keys.collect

    rdd2.values.collect

    -------------------------------------------------------------------------------------------

    -------------------------------------------------------------------------------------------

  • 相关阅读:
    [转] Vue + Webpack 组件式开发(练习环境)
    [转] 从零构建 vue2 + vue-router + vuex 开发环境到入门,实现基本的登录退出功能
    [转] Redux入门教程(快速上手)
    [转] 前端数据驱动的价值
    [转] React风格的企业前端技术
    [转] 对Array.prototype.slice.call()方法的理解
    [转] webpack之plugin内部运行机制
    [转] 静态资源的分布对网站加载速度的影响/浏览器对同一域名下并发加载资源数量
    Mysql 版本号、存储引擎、索引查询
    linux 查看CPU、内存、磁盘信息命令
  • 原文地址:https://www.cnblogs.com/makailong/p/9934228.html
Copyright © 2011-2022 走看看