spark streaming 实战

zoukankan html css js c++ java

spark streaming 实战
最近在学习spark的相关知识，重点在看spark streaming 和spark mllib相关的内容。

关于spark的配置： http://www.powerxing.com/spark-quick-start-guide/

这篇博客写的很全面：http://www.liuhaihua.cn/archives/134765.html

spark streaming:

是spark系统中处理流数据的分布式流处理框架，能够以最低500ms的时间间隔对流数据进行处理，延迟大概1s左右，

是一个准实时的流处理框架。

spark streaming 可以和 spark SQL、MLlib 和GraphX相结合，共同完成基于实时处理的复杂系统。

spark steaming 的原理：

如上图所示， spark streaming 将输入的数据按时间分割为若干段，每一段对应以恶spark job，最后将处理后的任务按返回，就像流水一样。

DStram：

是 Spark Streaming 对内部持续的实时数据流的抽象描述，即我们处理的一个实时数据流，在 Spark Streaming 中对应于一个 DStream 实例，

通俗的讲Dstream 一系列是RDD的集合。

spark Streaming 编程模型：

DStream （ Discretized Stream ）作为 Spark Streaming 的基础抽象，它代表持续性的数据流。这些数据流既可以通过外部输入源赖获取，也可以通过现有的 Dstream 的 transformation 操作来获得。在内部实现上， DStream 由一组时间序列上连续的 RDD 来表示。每个 RDD 都包含了自己特定时间间隔内的数据流，如下图所示：

而对DStream 的操作，也是映射到其内部的RDD上的，如下图，通过转换操作生存新的DStram：

spark Streaming 的三种运行场景：

1. 无状态操作

2. 有状态操作（updateStateByKey)

3. window操作

接下来分别说明。

无状态操作：每次计算的时间，仅仅计算当前时间切片的内容，如，每次只计算1s时间内产生的RDD

有状态操作：不断的把当前的计算和历史时间切片的RDD进行累计，如，计算某个单词出现的次数，需要把当前的状态与历史的状态相累加，随着时间的流逝，数据规模会越来越大

基于window的操作：针对特定的时间段，并以特定的时间间隔为单位的滑动操作，如每隔10秒，统计一下过去30秒过来的数据

如上图，红色的圈代表一个window，里面包含3个时间，并且window 每隔2个时间滑动一次，因此：

所以基于窗口的操作，需要指定2个参数：
- window length - The duration of the window (3 in the figure)
- slide interval - The interval at which the window-based operation is performed (2 in the figure).
编程实战：

官方提供的wordCount的实例：
package org.apache.spark.examples.streaming import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.storage.StorageLevel /** * Counts words in UTF8 encoded, ' ' delimited text received from the network every second. * * Usage: NetworkWordCount <hostname> <port> * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data. * * To run this on your local machine, you need to first run a Netcat server * `$ nc -lk 9999` * and then run the example * `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999` */ object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(1)) // Create a socket stream on target ip:port and count the // words in input stream of delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
首先运行
nc -lk 9999
然后打开另一个窗口，在spark的目录下运行
./bin/run-example streaming.NetworkWordCount localhost 9999
查看全文

相关阅读:
Create C++ Windows Forms Application in Visual Studio 2017
VS项目打包发布
 获取光标所在行索引
 拖拽生成控件副本
 MDI中的ToolStrip合并
 使用FFmpeg音视频格式转换
 获取项目中其他文件
 解决Winform大多数DPI问题
 《Pro Git》第3章分支
 《Pro Git》第2章 Git基础

原文地址：https://www.cnblogs.com/missmzt/p/5916731.html