zoukankan html css js c++ java

Spark学习之SparkStreaming

应用场景

流式计算

流计算与批量计算

批量计算

数据已经存在, 一次性读取所有的数据进行批量处理
流计算

数据源源不断的进来, 经过处理后落地

特点

特点	说明
`Spark Streaming` 是 `Spark Core API` 的扩展	`Spark Streaming` 具有类似 `RDD` 的 `API`, 易于使用, 并可和现有系统共用相似代码一个非常重要的特点是, `Spark Streaming` 可以在流上使用基于 `Spark` 的机器学习和流计算, 是一个一站式的平台
`Spark Streaming` 具有很好的整合性	`Spark Streaming` 可以从 `Kafka`, `Flume`, `TCP` 等流和队列中获取数据`Spark Streaming` 可以将处理过的数据写入文件系统, 常见数据库中
`Spark Streaming` 是微批次处理模型	微批次处理的方式不会有长时间运行的 `Operator`, 所以更易于容错设计微批次模型能够避免运行过慢的服务, 实行推测执行

入门案例

目标

使用 Spark Streaming 程序和 Socket server 进行交互, 从 Server 处获取实时传输过来的字符串, 拆开单词并统计单词数量, 最后打印出来每一个小批次的单词数量

实现过程

Step 1：创建工程

创建 IDEA Maven 工程, 步骤省略, 参考 Spark 第一天工程建立方式
导入 Maven 依赖, 省略, 参考 Step 2
创建 main/scala 文件夹和 test/scala 文件夹

Step 2 ：Maven 依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>sparkStreaming</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <scala.version>2.11.8</scala.version>
        <spark.version>2.2.0</spark.version>
        <slf4j.version>1.7.16</slf4j.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13.1</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>


    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>

            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.1.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass></mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Step 3：代码

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * @author noor9
 * @date 2021-02-01-19:55
 */
object StreamingWordCount {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[6]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    ssc.sparkContext.setLogLevel("WARN")

    val lines: ReceiverInputDStream[String] = ssc.socketTextStream(
      hostname = "xxx.xxx.xxx.xxxx",
      port = 9999,
      storageLevel = StorageLevel.MEMORY_AND_DISK_SER
    )

    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }

}

Step 4：使用Netcat

使用Netcat

运行结果

在 Spark 中, 一般使用 XXContext 来作为入口, Streaming 也不例外, 所以创建 StreamingContext 就是创建入口
开启 Socket 的 Receiver, 连接到某个 TCP 端口, 作为 Socket client, 去获取数据
选择 Receiver 获取到数据后的保存方式, 此处是内存和磁盘都有, 并且序列化后保存
类似 RDD 中的 Action, 执行最后的数据输出和收集
启动流和 JobGenerator, 开始流式处理数据
阻塞主线程, 后台线程开始不断获取数据并处理

注意点

Spark Streaming 并不是真正的来一条数据处理一条

Spark Streaming 的处理机制叫做小批量, 英文叫做 mini-batch, 是收集了一定时间的数据后生成 RDD, 后针对 RDD 进行各种转换操作, 这个原理提现在如下两个地方
- 控制台中打印的结果是一个批次一个批次的, 统计单词数量也是按照一个批次一个批次的统计
- 多长时间生成一个 RDD 去统计呢? 由 new StreamingContext(sparkConf, Seconds(1)) 这段代码中的第二个参数指定批次生成的时间
Spark Streaming 中至少要有两个线程

在使用 spark-submit 启动程序的时候, 不能指定一个线程
- 主线程被阻塞了, 等待程序运行
- 需要开启后台线程获取数据

创建 StreamingContext

val conf = new SparkConf().setAppName(appName).setMaster(master)
val ssc = new StreamingContext(conf, Seconds(1))

StreamingContext 是 Spark Streaming 程序的入口
在创建 StreamingContext 的时候, 必须要指定两个参数, 一个是 SparkConf, 一个是流中生成 RDD 的时间间隔
StreamingContext 提供了如下功能
- 创建 DStream, 可以通过读取 Kafka, 读取 Socket 消息, 读取本地文件等创建一个流, 并且作为整个 DAG 中的 InputDStream
- RDD 遇到 Action 才会执行, 但是 DStream 不是, DStream 只有在 StreamingContext.start() 后才会开始接收数据并处理数据
- 使用 StreamingContext.awaitTermination() 等待处理被终止
- 使用 StreamingContext.stop() 来手动的停止处理
在使用的时候有如下注意点
- 同一个 Streaming 程序中, 只能有一个 StreamingContext
- 一旦一个 Context 已经启动 (start), 则不能添加新的数据源 **

各种算子

20190620005229

这些算子类似 RDD, 也会生成新的 DStream
这些算子操作最终会落到每一个 DStream 生成的 RDD 中

算子	释义
`flatMap`	`lines.flatMap(_.split(" "))`将一个数据一对多的转换为另外的形式, 规则通过传入函数指定
`map`	`words.map(x => (x, 1))`一对一的转换数据
`reduceByKey`	`words.reduceByKey(_ + _)`这个算子需要特别注意, 这个聚合并不是针对于整个流, 而是针对于某个批次的数据

SparkStreaming原理

静态 DAG
动态切分
数据流入
容错机制

关于receiver的一些知识

receiver是分片的

receiver可以在每一个executer中运行

receiver是专门用于接受数据的一个组件

THE END

查看全文

相关阅读:
Java线程面试题 Top 50 （转载）
Java并发编程：volatile关键字解析
 转：【创龙TMS320C6748开发板试用】相关软件的安装与基本设置+CCS安装失败分析
 Linux格式化分区报错Could not start /dev/sda No such file or directory 解决办法
 转：用 git 下载 uboot 源码
 转：堆(heap)和栈(stack)有什么区别??
转：数字信号处理的学习资源
 转：VC中WORD,DWORD,unsigned long,unsigned short的区别(转)
转：ASCII码表_全_完整版
 转：CFile::Seek

原文地址：https://www.cnblogs.com/xp-thebest/p/14592567.html