2. Window 的 API及聚合函数

zoukankan html css js c++ java

2. Window 的 API及聚合函数
1. Window 的 API

在以后的实际案例中 Keyed Window 使用最多，所以我们需要掌握 Keyed Window 的算子，在每个窗口算子中包含了 Windows Assigner、Windows Trigger（窗口触发器）、Evictor（数据剔除器）、Lateness（时延设定）、Output Tag（输出标签）以及 Windows Funciton等组成部分，其中 Windows Assigner 和 Windows Funciton 是所有窗口算子必须指定的属性，其余的属性都是根据实际情况选择指定。
stream.keyBy(...) // 是Keyed类型数据集 .window(...) //指定窗口分配器类型 [.trigger(...)] //指定触发器类型（可选） [.evictor(...)] //指定evictor或者不指定（可选） [.allowedLateness(...)] //指定是否延迟处理数据（可选） [.sideOutputLateData(...)] //指定Output Lag（可选） .reduce/aggregate/fold/apply() //指定窗口计算函数74 [.getSideOutput(...)] //根据Tag输出数据（可选）
- Windows Assigner：指定窗口的类型，定义如何将数据流分配到一个或多个窗口；
- Windows Trigger：指定窗口触发的时机，定义窗口满足什么样的条件触发计算；
- Evictor：用于数据剔除；
- allowedLateness：标记是否处理迟到数据，当迟到数据到达窗口中是否触发计算；
- Output Tag：标记输出标签，然后在通过 getSideOutput 将窗口中的数据根据标签输出；
- Windows Funciton：定义窗口上数据处理的逻辑，例如对数据进行 sum 操作。
2. 窗口聚合函数

如果定义了 Window Assigner 之后，下一步就可以定义窗口内数据的计算逻辑，这也就是 Window Function 的定义。Flink 中提供了四种类型的 Window Function，分别为ReduceFunction、AggregateFunction 以及 ProcessWindowFunction,（sum 和 max)等。

前三种类型的 Window Fucntion 按照计算原理的不同可以分为两大类：
- 一类是增量聚合函数：对应有 ReduceFunction、AggregateFunction；
- 另一类是全量窗口函数，对应有 ProcessWindowFunction（还有 WindowFunction）。
增量聚合函数计算性能较高，占用存储空间少，主要因为基于中间状态的计算结果，窗口中只维护中间结果状态值，不需要缓存原始数据。而全量窗口函数使用的代价相对较高，性能比较弱，主要因为此时算子需要对所有属于该窗口的接入数据进行缓存，然后等到窗口触发的时候，对所有的原始数据进行汇总计算。

1) ReduceFunction

ReduceFunction 定义了对输入的两个相同类型的数据元素按照指定的计算方法进行聚合的逻辑，然后输出类型相同的一个结果元素。
//每隔5秒统计每个基站的日志数量 data.map(stationLog=>((stationLog.sid,1))) .keyBy(_._1) .window(TumblingEventTimeWindows.of(Time.seconds(5))) .reduce((v1,v2)=>(v1._1,v1._2+v2._2))
2) AggregateFunction

和 ReduceFunction 相似，AggregateFunction 也是基于中间状态计算结果的增量计算函数，但 AggregateFunction 在窗口计算上更加通用。 AggregateFunction 接口相对ReduceFunction 更加灵活，实现复杂度也相对较高。AggregateFunction 接口中定义了三个需要复写的方法，其中 add()定义数据的添加逻辑，getResult 定义了根据 accumulator 计
算结果的逻辑，merge 方法定义合并 accumulator 的逻辑。

案例分析：
import com.it.flink.source.StationLog import org.apache.flink.api.common.functions.AggregateFunction import org.apache.flink.api.scala.createTypeInformation import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment import org.apache.flink.streaming.api.scala.function.WindowFunction import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows import org.apache.flink.streaming.api.windowing.time.Time import org.apache.flink.streaming.api.windowing.windows.TimeWindow import org.apache.flink.util.Collector /** * 统计每隔3s计算最近5s内基站的日志数量 */ object AggregateFunctionByWindow { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment val stream = env.socketTextStream("node1", 8888) .map(line => { val arr: Array[String] = line.split(",") StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong) }) stream.map(log => (log.sid, 1)) .keyBy(_._1) // 开窗，滑动窗口 .window(SlidingProcessingTimeWindows.of(Time.seconds(5), Time.seconds(3))) // .timeWindow(Time.seconds(5), Time.seconds(3)) .aggregate(new MyAggregateFunction, new MyWindowFunction) env.execute() } } /** * add方法来一条执行一次 */ class MyAggregateFunction extends AggregateFunction[(String, Int), Long, Long] { // 初始化一个累加器开始的时候为0 override def createAccumulator(): Long = 0 override def add(in: (String, Int), acc: Long): Long = { in._2 + acc } override def getResult(acc: Long): Long = acc override def merge(acc: Long, acc1: Long): Long = acc + acc1 } /** * WindowFunction 输入来自 AggregateFunction */ class MyWindowFunction extends WindowFunction[Long, (String, Long), String, TimeWindow] { override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[(String, Long)]): Unit = { out.collect((key, input.iterator.next())) // next得到第一个值，迭代器中只有一个值 } }
3) ProcessWindowFunction

前面提到的 ReduceFunction 和 AggregateFunction 都是基于中间状态实现增量计算的窗口函数，虽然已经满足绝大多数场景，但在某些情况下，统计更复杂的指标可能需要依赖于窗口中所有的数据元素，或需要操作窗口中的状态数据和窗口元数据，这时就需要使用到ProcessWindowsFunction，ProcessWindowsFunction 能够更加灵活地支持基于窗口全部数据元素的结果计算，例如对整个窗口数据排序取 TopN ，这样的需要就必须使用ProcessWindowFunction。
//每隔5秒统计每个基站的日志数量 data.map(stationLog=>((stationLog.sid,1))) .keyBy(_._1) .timeWindow(Time.seconds(5)) .process(new ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] { override def process(key: String, context: Context, elements: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = { println("-------") out.collect((key,elements.size)) } }) .print()
查看全文

相关阅读:
搜索引擎
 Mybatis springmvc面试题
 spring框架面试题
 数据库
 javaWEB面试题
 JavaWeb
SpringCloud2
网络
 比特币网络架构及节点发现分析
 Github推荐一个国内牛人开发的超轻量级通用人脸检测模型

原文地址：https://www.cnblogs.com/yj2434/p/14059227.html

热门文章
scrapy
Django中间件
 归并排序
 Django配置
 非比较排序
 快速排序
 堆排序和优先队列
 网易云歌单生成外链
 Solr
Spring中文文档

2. Window 的 API及聚合函数

1. Window 的 API

2. 窗口聚合函数

1) ReduceFunction

2) AggregateFunction

3) ProcessWindowFunction