zoukankan      html  css  js  c++  java
  • SparkStreaming--reduceByKeyAndWindow

    1、reduceByKeyAndWindow(_+_,Seconds(3), Seconds(2))
        可以看到我们定义的window窗口大小Seconds(3s) ,是指每2s滑动时,需要统计前3s内所有的数据。

    2、对于他的重载函数reduceByKeyAndWindow(_+_,_-_,Seconds(3s),seconds(2))
         设计理念是,当 滑动窗口的时间Seconds(2) < Seconds(3)(窗口大小)时,两个统计的部分会有重复,那么我们就可以
         不用重新获取或者计算,而是通过获取旧信息来更新新的信息,这样即节省了空间又节省了内容,并且效率也大幅提升。
        
         如上图所示,2次统计重复的部分为time3对用的时间片内的数据,这样对于window1,和window2的计算可以如下所示
         win1 = time1 + time2 + time3
         win2 = time3 + time4 + time5
         
         更新为
         win1 = time1 + time2 + time3
         win2 = win1+ time4 + time5 - time2 - time3
         
         这样就理解了吧,  _+_是对新产生的时间分片(time4,time5内RDD)进行统计,而_-_是对上一个窗口中,过时的时间分片
         (time1,time2) 进行统计   

    3、注意事项
    /**
    * Return a new DStream by applying incremental `reduceByKey` over a sliding window.
    * The reduced value of over a new window is calculated using the old window's reduced value :
    * 1. reduce the new values that entered the window (e.g., adding new counts)
    *
    * 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts)
    *
    * This is more efficient than reduceByKeyAndWindow without "inverse reduce" function.
    * However, it is applicable to only "invertible reduce functions".
    * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
    * @param reduceFunc associative reduce function
    * @param invReduceFunc inverse reduce function
    * @param windowDuration width of the window; must be a multiple of this DStream's
    * batching interval
    * @param slideDuration sliding interval of the window (i.e., the interval after which
    * the new DStream will generate RDDs); must be a multiple of this
    * DStream's batching interval
    * @param filterFunc Optional function to filter expired key-value pairs;
    * only pairs that satisfy the function are retained
    */
    def reduceByKeyAndWindow(
    reduceFunc: (V, V) => V,
    invReduceFunc: (V, V) => V,
    windowDuration: Duration,
    slideDuration: Duration = self.slideDuration,
    numPartitions: Int = ssc.sc.defaultParallelism,
    filterFunc: ((K, V)) => Boolean = null
    ): DStream[(K, V)] = ssc.withScope {
    reduceByKeyAndWindow(
    reduceFunc, invReduceFunc, windowDuration,
    slideDuration, defaultPartitioner(numPartitions), filterFunc
    )
    }

                                                                                                                                                                                              


         




  • 相关阅读:
    104_如何彻底删除VMware
    学习笔记(25)- NLP的几个概念
    学习笔记(24)- plato-训练中文模型
    学习笔记(23)- plato-准备中文语料
    学习笔记(22)- plato-训练端到端的模型
    学习笔记(21)- texar 文本生成
    学习笔记(35)-安装pyhanlp
    NLP直播-1 词向量与ELMo模型
    线上学习-语言模型 language model
    学习笔记(20)- Google LaserTagger
  • 原文地址:https://www.cnblogs.com/zDanica/p/5471592.html
Copyright © 2011-2022 走看看