zoukankan      html  css  js  c++  java
  • spark-streaming first insight

    一、

    Spark Streaming 构建在Spark core API之上,具备可伸缩,高吞吐,可容错的流处理模块。

    1)支持多种数据源,如Kafka,Flume,Socket,文件等;

    • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
    • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies.

    2)处理完成数据可写入Kafka,Hdfs,本地文件等多种地方;

    DStream:

    Spark Streaming对持续流入的数据有个高层的抽像:

    It represents a continuous stream of data

    a DStream is represented by a continuous series of RDDs,Each RDD in a DStream contains data from a certain interval

    Any operation applied on a DStream translates to operations on the underlying RDDs.

    什么是RDD?

    RDD是Resilient Distributed Dataset的缩写,中文译为弹性分布式数据集,是Spark中最重要的概念。

    RDD是只读的、分区的,可容错的数据集合。

    何为弹性?

    RDD可在内存、磁盘之间任意切换

    RDD可以转换成其它RDD,可由其它RDD生成

    RDD可存储任意类型数据

    二、基本概念

    1)add dependency

    <dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-streaming_2.11</artifactId>

    <version>2.3.1</version>

    </dependency>

    其它想关依赖查询:

    https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.2.0

    2)文件作为DStream源,是如何被监控的?

    1)文件格式须一致

    2)根据modify time开成流,而非create time

    3)处理时,当前文件变更不会在此window处理,即不会reread

    4)可以调用 FileSystem.setTimes()来修改文件时间,使其在下个window被处理,即使文件内容未被修改过

    三、Transform operation

    window operation

    Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data.

    every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. 

    在一个时间窗口内的RDD被合并为一个RDD来处理。

    Any window operation needs to specify two parameters:

    window length: The duration of the window

    sliding interval: The interval at which the window operation if performed

    四、Output operation

    使用foreachRDD

    dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently. 

    CheckPoint概念

    Performance Tuning

    Fault-tolerance Semantics

  • 相关阅读:
    Centos7.2安装MariaDB数据库,并进行基础配置
    Web安全之环境搭建
    PHP构建一句话木马
    Spark2.1.0安装
    Spark2.1.0编译
    cdh-5.10.0搭建安装
    八、频繁模式挖掘Frequent Pattern Mining
    七、特征提取和转换
    六、降维
    五、聚类
  • 原文地址:https://www.cnblogs.com/gm-201705/p/9533271.html
Copyright © 2011-2022 走看看