zoukankan      html  css  js  c++  java
  • Spark Streaming

    1 Overview
    Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like mapreducejoin and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. 
    Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
    input data stream->spark streaming->batches of input data->spark engine->batches of processed data
    Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
     
    2 A quick example
    //start the data server
    # nc -lk 9999
       
    3 Basic concepts
     
    3.1 Linking
     
    3.2 Initializing StreamingContext
    step1 Define the input sources by creating input DStreams.
    step2 Define the streaming computations by applying transformation and output operations to DStreams.
    step3 Start receiving data and processing it using streamingContext.start().
    step4 Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
    step5 The processing can be manually stopped using streamingContext.stop().
     
    Points to remember:
    • Once a context has been started, no new streaming computations can be set up or added to it.
    • Once a context has been stopped, it cannot be restarted.
    • Only one StreamingContext can be active in a JVM at the same time.
    • stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() calledstopSparkContext to false.
    • A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
     
    3.3 Discretized Streams (DStreams)
    Discretized(离散化处理的) Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs.
     
    3.4 Input DStreams and Receivers
    Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.
    Points to remember
    • When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).

    • Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.

    Basic Sources:
    scc.fileStream()
    scc.queueStream()
    scc.socketTextStream()
    scc.actorStream()
    Advanced Sources:
    Kafka
    Flume
    Kinesis
    Twitter
    Custom Sources:
    Input DStreams can also be created out of custom data sources. All you have to do is implement a user-defined receiver (see next section to understand what that is) that can receive data from the custom sources and push it into Spark. See the Custom Receiver Guide for details.
     
    3.5 Transformations on DStreams
    transformations that  worth discussing in more detail:
    UpdateStateByKey Operation
    Transform Operation
    Window Operation
         Any window operation needs to specify two parameters:
             window length - The duration of the window (3 in the figure).
             sliding interval - The interval at which the window operation is performed (2 in the figure).
         I want to extend the earlier example by generating word counts over the last 30 seconds of data, every 10 seconds. 
             // Reduce last 30 seconds of data, every 10 seconds
             val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
    Join Operation : leftOuterJoin, rightOuterJoin, fullOuterJoin
     
    3.6 Output Operations on DStreams
    Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
     
    3.7 DataFrame and SQL Operations
     
    3.8 MLlib Operations
     
    3.9 Caching / Persistence
     
    3.10 Checkpointing
    The default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
    3.11 Deploying Applications
     
    3.12 Monitoring Applications
     
    4 Performance Tuning

    4.1 Reducing the Batch Processing Times
     
    4.2 Setting the Right Batch Interval
     
    4.3 Memory Tuning

     

  • 相关阅读:
    python 序列的方法
    python函数基础
    Flume入门:安装、部署
    Flume日志收集系统介绍
    Python 列表(List)操作方法详解
    在linux下新增一块硬盘的操作。(包含大于2T的硬盘在linux下挂载操作)
    python字符串操作大全
    Python中的迭代器漫谈
    Linux 常用快捷键
    ssh采用expect实现自动输入密码登录、拷贝
  • 原文地址:https://www.cnblogs.com/sunflower627/p/4997652.html
Copyright © 2011-2022 走看看