分布式计算学习笔记
一、概述
1.1 Kafka
Apache Kafka是分布式发布-订阅消息系统。它最初由LinkedIn公司开发,之后成为Apache项目的一部分。Kafka是一种快速、可扩展的、设计内在就是分布式的,分区的和可复制的提交日志服务。
关于Kafka:
1、一个分布式的消息发布、订阅系统;
2、设计用来处理实时的流数据;
3、最初由LinkedIn开发,现为Apache的一部分;
4、没有遵守JMS标准,也没有使用JMS的API;
5、把个分区的更新保存到主题中 (Kafka maintains feeds of messages in topics)。
Kafka是一个消息中间件,它的特点是:
A、关注大吞吐量,而不是别的特性;
B、针对实时性场景;
C、关于消息被处理的状态是在Consumer端维护,而不是由Kafka Server端维护;
D、分布式,Producer、Broker和Consumer都分布于多台机器上。
1.1.1 Total
Apache Kafka™ is a distributed streaming platform. What exactly does that mean?
We think of a streaming platform as having three key capabilities:
² It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
² It lets you store streams of records in a fault-tolerant way.
² It lets you process streams of records as they occur.
What is Kafka good for?
ü It gets used for two broad classes of application:
ü Building real-time streaming data pipelines that reliably get data between systems or applications
ü Building real-time streaming applications that transform or react to the streams of data
ü To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
First a few concepts:
- Kafka is run as a cluster on one or more servers.
- The Kafka cluster stores streams of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
Kafka has four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version.
1.1.2 Topics and Logs
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
For each topic, the Kafka cluster maintains a partitioned log that looks like this:
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example, a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
1.1.3 Distribution
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
1.1.4 Producers
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
1.1.5 Consumers
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
1.2 Storm
Storm是一个实时处理系统,其典型应用场景:消费者拉取计算。Storm提供了storm-kafka,因而可以直接kafka的低级API读取数据。
Storm提供以下功能:
1、至少一次的消息处理;
2、容错;
3、水平可扩展;
4、没有中间队列;
5、更少的操作开销;
6、做最恰当的工作(just works);
7、宽长应用场景的覆盖,如:流处理、连续计算、分布式的远程进程调用等。
Storm架构图
Storm工作任务的Topology:
1.2.1 任务调度及负载均衡
- nimbus将可以工作的worker称为worker-slot.
- nimbus是整个集群的控管核心,总体负责了topology的提交、运行状态监控、负载均衡及任务重新分配,等等工作。nimbus分配的任务包含了topology代码所在的路径(在nimbus本地)、tasks、executors及workers信息。worker由node + port唯一确定。
- supervisor负责实际的同步worker的操作。一个supervisor称为一个node。所谓同步worker,是指响应nimbus的任务调度和分配,进行worker的建立、调度与销毁。其通过将topology的代码从nimbus下载到本地以进行任务调度。
- 任务分配信息中包含task到worker的映射信息task -> node + host,所以worker节点可据此信息判断跟哪些远程机器通讯。
1.2.2 Worker
工作者,执行一个拓扑的子集,可能为一个或多个组件执行一个或者多个执行任务(Exectutor)。
1.2.3 Executor
工作者产生的线程。
1.2.4 Task
实际的数据处理部分。
1.3 Flume
Flume是Cloudera提供的一个分布式、可靠、和高可用的海量日志采集、聚合和传输的日志收集系统,支持在日志系统中定制各类数据发送方,用于收集数据。同时,Flume提供对数据进行简单处理,并写到各种数据接受方的能力。
1.4 Hadoop
Hadoop是一个由Apache基金会所开发的分布式系统基础架构。
Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以以流的形式访问(streaming access)文件系统中的数据。
Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算
1.5 Spark
Apache Spark
™
is a fast and general engine for large-scale data processing
。
Spark Streaming
属于Spark的核心api,它支持高吞吐量、支持容错的实时流数据处理。
它可以接受来自Kafka, Flume, Twitter, ZeroMQ和TCP Socket的数据源,使用简单的api函数比如 map
, reduce
, join
, window
等操作,还可以直接使用内置的机器学习算法、图算法包来处理数据。
它的工作流程像下面的图所示一样,接受到实时数据后,给数据分批次,然后传给Spark Engine处理最后生成该批次的结果。它支持的数据流叫Dstream,直接支持Kafka、Flume的数据源。Dstream是一种连续的RDDs。
1.5.1 Shark
Shark ( Hive on Spark): Shark基本上就是在Spark的框架基础上提供和Hive一样的H iveQL命令接口,为了最大程度的保持和Hive的兼容性,Shark使用了Hive的API来实现query Parsing和 Logic Plan generation,最后的PhysicalPlan execution阶段用Spark代替Hadoop MapReduce。通过配置Shark参数,Shark可以自动在内存中缓存特定的RDD,实现数据重用,进而加快特定数据集的检索。同时,Shark通过UDF用户自定义函数实现特定的数据分析学习算法,使得SQL数据查询和运算分析能结合在一起,最大化RDD的重复使用。
1.5.2 Spark streaming
Spark streaming: 构建在Spark上处理Stream数据的框架,基本的原理是将Stream数据分成小的时间片断(几秒),以类似batch批量处理的方式来处理这小部分数据。Spark Streaming构建在Spark上,一方面是因为Spark的低延迟执行引擎(100ms+)可以用于实时计算,另一方面相比基于Record的其它处理框架(如Storm),RDD数据集更容易做高效的容错处理。此外小批量处理的方式使得它可以同时兼容批量和实时数据处理的逻辑和算法。方便了一些需要历史数据和实时数据联合分析的特定应用场合。
1.5.3 Bagel
Bagel: Pregel on Spark,可以用Spark进行图计算,这是个非常有用的小项目。Bagel自带了一个例子,实现了Google的PageRank算法。
1.6 Solr
Solr是一个独立的企业级搜索应用服务器,它对外提供类似于Web-service的API接口。用户可以通过http请求,向搜索引擎服务器提交一定格式的XML文件,生成索引;也可以通过Http Get操作提出查找请求,并得到XML格式的返回结果。
Solr是一个高性能,采用Java5开发,基于Lucene的全文搜索服务器。同时对其进行了扩展,提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展并对查询性能进行了优化,并且提供了一个完善的功能管理界面,是一款非常优秀的全文搜索引擎。
1.7 MongoDB
MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。
MongoDB是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库当中功能最丰富,最像关系数据库的。他支持的数据结构非常松散,是类似json的bson格式,因此可以存储比较复杂的数据类型。Mongo最大的特点是他支持的查询语言非常强大,其语法有点类似于面向对象的查询语言,几乎可以实现类似关系数据库单表查询的绝大部分功能,而且还支持对数据建立索引。
1.8 Mesos
Mesos是Apache下的开源分布式资源管理框架,它被称为是分布式系统的内核。Mesos最初是由加州大学伯克利分校的AMPLab开发的,后在Twitter得到广泛使用。
Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.
1.9 HBase
HBase是一个分布式的、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System)所提供的分布式数据存储一样,HBase在Hadoop之上提供了类似于Bigtable的能力。HBase是Apache的Hadoop项目的子项目。HBase不同于一般的关系数据库,它是一个适合于非结构化数据存储的数据库。另一个不同的是HBase基于列的而不是基于行的模式。
1.10 Cassandra
Cassandra 的数据模型是基于列族(Column Family)的四维或五维模型。它借鉴了 Amazon的Dynamo和 Google's BigTable的数据结构和功能特点,采用Memtable和SSTable 的方式进行存储。在 Cassandra 写入数据之前,需要先记录日志 ( CommitLog ),然后数据开始写入到Column Family对应的 Memtable中Memtable 是一种按照 key 排序数据的内存结构,在满足一定条件时,再把Memtable的数据批量的刷新到磁盘上,存储为SSTable 。
1.11 ZooKeeper
ZooKeeper是一个分布式的,开放源码的分布式应用程序协调服务,是Google的Chubby一个开源的实现,是Hadoop和Hbase的重要组件。它是一个为分布式应用提供一致性服务的软件,提供的功能包括:配置维护、域名服务、分布式同步、组服务等。
ZooKeeper的目标就是封装好复杂易出错的关键服务,将简单易用的接口和性能高效、功能稳定的系统提供给用户。
ZooKeeper是以Fast Paxos算法为基础的,Paxos 算法存在活锁的问题,即当有多个proposer交错提交时,有可能互相排斥导致没有一个proposer能提交成功,而Fast Paxos作了一些优化,通过选举产生一个leader (领导者),只有leader才能提交proposer。
ZooKeeper的基本运转流程:
1、选举Leader;
2、同步数据;
3、Leader要具有最高的执行ID,类似root权限;
4、集群中大多数的机器得到响应并follow选出的Leader。
1.12 YARN
YARN是一个完全重写的Hadoop集群架构。YARN是新一代Hadoop资源管理器,通过YARN,用户可以运行和管理同一个物理集群机上的多种作业,例如MapReduce批处理和图形处理作业。这样不仅可以巩固一个组织管理的系统数目,而且可以对相同的数据进行不同类型的数据分析。
与第一版Hadoop中经典的MapReduce引擎相比,YARN 在可伸缩性、效率和灵活性上提供了明显的优势。小型和大型Hadoop集群都从YARN中受益匪浅。对于最终用户(开发人员,而不是管理员),这些更改几乎是不可见的,因为可以使用相同的MapReduce API和CLI 运行未经修改的MapReduce作业。
二、架构
2.1 Spark
Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行计算框架,Spark基于map reduce算法实现的分布式计算,拥有Hadoop MapReduce所具有的优点;但不同于MapReduce的是Job中间输出和结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的map reduce的算法。
2.2 Spark Streaming
Spark Streaming是将流式计算分解成一系列短小的批处理作业。这里的批处理引擎是Spark,也就是把Spark Streaming的输入数据按照batch size(如1秒)分成一段一段的数据(Discretized Stream),每一段数据都转换成Spark中的RDD(Resilient Distributed Dataset),然后将Spark Streaming中对DStream的Transformation操作变为针对Spark中对RDD的Transformation操作,将RDD经过操作变成中间结果保存在内存中。整个流式计算根据业务的需求可以对中间的结果进行叠加,或者存储到外部设备。
2.3 Flume+Kafak+Storm
消息通过各种方式进入到Kafka消息中间件,比如可以通过使用Flume来收集日志数据,然后在Kafka中路由暂存,然后再由实时程序Storm做实时分析,这时我们就需要将Storm的Spout中读取Kafka中的消息,然后交由具体的Spot组件去分析处理。
Flume+Kafak+Storm,Flume作为消息的Producer,生产的消息数据(日志数据、业务请求数据等),发布到Kafka中,然后通过订阅的方式,使用Storm的Topology作为消息的Consumer,在Storm集群中分别进行处理。
处理方式有以下两种:
1、直接使用Storm的Topology对数据进行实时分析处理;
2、整合Storm+HDFS,将消息处理后写入HDFS进行离线分析处理。
Kafka集群需要保证各个Broker的ID在整个集群中必须唯一。
Storm集群也依赖Zookeeper集群,要保证Zookeeper集群正常运行。
2.4 Kafka+Storm应用场景
1、需要扩展和计划扩展
Kafka+Storm可以方便的扩展拓扑,扩展仅限于硬件。
2、快速应用
开源软件快速 进化和社区支持。微服务,只需要做必须做的事情。
3、风险
拓扑并不成熟,但也别无选择。
三、比较
3.1 Storm和SparkStreaming
Storm风暴和Spark Streaming火花流都是分布式流处理的开源框架。
3.1.1 处理模型、延迟
Storm处理的是每次传入的一个事件,而Spark Streaming是处理某个时间段窗口内的事件流。因此,Storm处理一个事件可以达到秒内的延迟,而Spark Streaming则有几秒钟的延迟。
3.1.2 容错、数据保证
在容错数据保证方面的权衡是Spark Streaming提供了更好的支持容错状态计算,在Storm中每个单独的记录当它通过系统时必须被跟踪,所以Storm能够至少保证每个记录将被处理一次,但是在从错误中恢复过来的时候允许出现重复记录。这意味着可变状态可能不正确地被更新两次。
简而言之如果你需要处理秒内的延迟,Storm是一个不错的选择,而且没有数据丢失。如果你需要有状态的计算,而且要完全保证每个事件只被处理一次,Spark Streaming则更好。Spark Streaming编程逻辑也可能更容易,因为它类似于批处理程序Hadoop。
3.1.3 实现、编程API
Storm初次是由Clojure实现,而Spark Streaming是使用Scala。如果你想看看代码还让自己的定制时需要注意的地方,这样以便发现每个系统是如何工作的。Storm是由BackType和Twitter开发;Spark Streaming是在加州大学伯克利分校开发。
Spark基于这样的理念:当数据庞大时,把计算过程传递给数据要比把数据传递给计算过程要更富效率,每个节点存储(或缓存)它的数据集,然后任务被提交给节点。
Storm的架构和Spark截然相反。Storm是一个分布式流计算引擎。每个节点实现一个基本的计算过程,而数据项在互相连接的网络节点中流进流出。和Spark相反,这个是把数据传递给过程。两个框架都用于处理大量数据的并行计算。Storm在动态处理生成的“小数据块”上要更好。不确定哪种方式在数据吞吐量上要具优势,不过Storm计算时间延迟要小。
3.2 Storm和Hadoop
Hadoop是磁盘级计算,进行计算时,数据在磁盘上,需要读写磁盘。Storm是内存级计算,数据直接通过网络导入内存。
Hadoop |
Storm |
批处理 |
实时处理 |
Jobs运行到结束 |
Topologies一直运行 |
结点有状态 |
结点无状态 |
可扩展 |
可扩展 |
保证数据不丢 |
保证数据不丢 |
开源 |
开源 |
大的批处理(Big Batch processing) |
快速、反映式、实时的处理(Fast, reactive, real time processing) |
3.3 Spark和Hadoop
Spark的中间数据放到内存中,对于迭代运算效率更高。
Spark更适合于迭代运算比较多的ML和DM运算。因为在Spark里面,有RDD的抽象概念。
3.3.1 Spark比Hadoop更通用。
Spark提供的数据集操作类型有很多种,不像Hadoop只提供了Map和Reduce两种操作。比如map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort,partionBy等多种操作类型,Spark把这些操作称为Transformations。同时还提供Count, collect, reduce, lookup, save等多种actions操作。
这些多种多样的数据集操作类型,给给开发上层应用的用户提供了方便。各个处理节点之间的通信模型不再像Hadoop那样就是唯一的Data Shuffle一种模式。用户可以命名,物化,控制中间结果的存储、分区等。可以说编程模型比Hadoop更灵活。
不过由于RDD的特性,Spark不适用那种异步细粒度更新状态的应用,例如web服务的存储或者是增量的web爬虫和索引。就是对于那种增量修改的应用模型不适合。
3.3.2 容错性
在分布式数据集计算时通过checkpoint来实现容错,而checkpoint有两种方式,一个是checkpoint data,一个是logging the updates。用户可以控制采用哪种方式来实现容错。
3.3.3 可用性
Spark通过提供丰富的Scala, Java,Python API及交互式Shell来提高可用性。
3.4 Spark vs. Hadoop vs. Storm
3.4.1 相同点
1、都是开源消息处理框架;
2、都能用于实时的商业智能和大数据分析;
3、因其实现方法简单,所以常用于大数据处理;
4、都是基于JVM的实现,使用的语言有Java、Scala和Clojure;
3.4.2 不同点
数据处理模型
Hadoop MapReduce最适合用于批处理。对于小数据要求实时处理的应用,需要用其他的开源平台,如:Impala或者Storm。Apache Spark设计主要用于普通的数据处理,它可以已有的机器学习库和流程图。因为Spark有很高的性能,所以他即可以用于批处理也可以实时处理。Spark可以在单个平台处理事务,而不需要跨平台。
微批处理是一种特殊的尺寸更小的批处理,微批处理提供带状态的计算,因此开窗口(Windowing)变得更容易。
Apache Storm是基于流的处理架构,使用Trident可以用于微批处理。
Spark是批处理架构,通过Spark Streaming可以用于微批处理。
Spark和Storm的主要区别在于:Spark是数据并行的计算,Storm是任务并行的计算。
Feature |
Apache Storm/Trident |
Spark Streaming |
Programming languages |
Java Clojure Scala |
Java Scala |
Reliability |
Supports “exactly once” processing mode. Can be used in other modes like “at least once” processing & “at most once” processing mode as well. |
Supports only “exactly once” processing mode. |
Stream Source |
Spout |
HDFS |
Stream Primitives |
Tuple, Partitions |
DStream |
Persistence |
MapState |
Per RDD |
State management |
Supported |
Supported |
Resource Management |
Yarn, Mesos |
Yarn, Mesos |
Provisioning |
Apache Ambari |
Basic monitoring using ganglia |
Messaging |
ZeroMA, Netty |
Netty, Akka |
在Spark流处理中,如果一个工作者(Worker)结点失败,系统可以根据数据输入的拷贝重新计算。但是如果下网络接收者(network receiver)出现问题,数据将无法复制到其他结点。总的来说,只在HDFS备份的数据是安全的。
在Storm/Trident中,如果一个工作运行出错,雨云(nimbus)会把失败工作的状态分配给系统中的其他工作者。所有发给失败结点的数据三元组都会超时,因而可以自动的发送给其他结点。在Storm中,发送可达的保证是基于数据源安全上。
Situation |
Choice of framework |
Strictly low latency |
Storm can provide better latency with fewer restrictions than spark streaming. |
Low development cost |
With Spark, the same code base can be used for batch processing and stream processing, But with Storm, it is not possible. |
Message Delivery Gurantee |
Both Apache Storm(Trident) and Spark streaming offer “exactly once” processing mode |
Fault tolerance |
Both frameworks are relatively fault tolerant to the same extent. In Apache Storm/Trident, if a process fails, the supervisor process will restart it automatically as state management is handled through ZooKeeper. Spark handles restarting workers via resource manager which can be YARN, Mesos, or its standalone manager. |
四、总结
Spark是大数据分析工具中的瑞士军刀。
Storm更擅长于可靠的处理无限制流数据的实时处理,Hadoop这适合批处理。
Hadoop、Spark、Storm都是开源的数据处理平台,虽然他们的功能相互重叠,但他们有不同的侧重点。
Hadoop是开源的分布式数据框架。Hadoop用于大数据集的存储和在不同集群上的数据分析和处理。Hadoop的MapReduce分布计算用于比处理,这也是Hadoop做为数据仓库而不是数据分析工具的原因。
Spark没有自己的分布式存储系统,所以需要借Hadoop的HDFS来保存数据。
Storm是一个任务并行、开源分布式计算系统。Storm在拓扑中有自己独立的工作流,如有向非循环图。拔掉一直运行,除非被打断或者系统停止运行。Storm并不在Hadoop集群上工作,而是基于ZooKeeper来管理处理流程。Storm可以读、写文件到HDFS。
Hadoop是大数据处理中很受欢迎,但Spark和Storm更受欢迎。
五、附注
下为英语原文,已经翻译。
Understanding the differences
1) Data processing models
Hadoop MapReduce is best suited for batch processing. For bit data applications that require real time options, organizations must use other open source platform like Impala or Storm. Apache Spark is designed to do more than plain data processing as it can make use of existing machine learning libraries and process graphs. Thanks to the high performance of Apache spark, it can be used for both batch processing and real time processing. Spark provides an opportunity to use a single platform for everything rather than splitting the tasks on different open source platforms-avoiding the overhead of learning and maintaining different platforms.
Micro-batching s a special kind of batch processing wherein the batch size is orders smaller. Windowing becomes easy with micro-batching as it offer stateful computation of data.
Storm is a stream processing framework that also does micro-batching (Trident).
Spark is a batch processing framework that also does micro-batching (Spark Streaming).
Apache Storm is a stream processing framework, which can do micro-batching using Trident(an abstraction on Storm to perform statefule stream processing in batches).
Spark is a frame to perform batch processing. It can also do micro-batching using Spark Streaming(an abstraction on Spark to perform stateful stream processing).
One key difference between these two frameworks is that spark performs Data-parallel computions while Storm performs Taks-Paralle computations.
Feature |
Apache Storm/Trident |
Spark Streaming |
Programming languages |
Java Clojure Scala |
Java Scala |
Reliability |
Supports “exactly once” processing mode. Can be used in other modes like “at least once” processing & “at most once” processing mode as well. |
Supports only “exactly once” processing mode. |
Stream Source |
Spout |
HDFS |
Stream Primitives |
Tuple, Partitions |
DStream |
Persistence |
MapState |
Per RDD |
State management |
Supported |
Supported |
Resource Management |
Yarn, Mesos |
Yarn, Mesos |
Provisioning |
Apache Ambari |
Basic monitoring using ganglia |
Messaging |
ZeroMA, Netty |
Netty, Akka |
In Spark streaming, if a worker node fails, then the system can re-compute from the lest over copy of input data. But , if the node where the network receiver runs is failing, the the data which is not yet replicated to other node might be lost. In short, only HDFS backed data source is safe.
In Apache Storm/Trident, if a worker fails, the nimbus assigns the worker’s tasks to other nodes in the system. All tuples sent to the failed node will be timed out and hence replayed automatically. In Storm as well, delivery gurantee depends on a safe data source.
Situation |
Choice of framework |
Strictly low latency |
Storm can provide better latency with fewer restrictions than spark streaming. |
Low development cost |
With Spark, the same code base can be used for batch processing and stream processing, But with Storm, it is not possible. |
Message Delivery Gurantee |
Both Apache Storm(Trident) and Spark streaming offer “exactly once” processing mode |
Fault tolerance |
Both frameworks are relatively fault tolerant to the same extent. In Apache Storm/Trident, if a process fails, the supervisor process will restart it automatically as state management is handled through ZooKeeper. Spark handles restarting workers via resource manager which can be YARN, Mesos, or its standalone manager. |
Spark is what you might call a Swiss Army knife of Big Data analytics tools.
Storm makes it easy to reliable process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.
Hadoop, Spark, Storm are some of the popular open source platforms for real time data processing. Each of these tools has some intersecting functionalities. However, they have different role to play.
Hadoop is an open source distributed processing framework. Hadoop is used for storing large data sets and running distributed analytics processes on various clusters. Hadoop MapReduce is limited to batch processing of one job at a time. This is the reason why these days Hadoop is being used extensively as a data warehousing tool and not as data analysis tool.
Spark doesn’t have its own distributed storage system. This is the reason why most of the big data projects install Apache Spark on Hadoop so that the advanced big data applications can be run on Spark by using the data stored in Hadoop Distributed File System.
Storm is a task parallel, open source distributed computing system. Storm has its independent workflows in topologies i.e. Directed Acyclic Graphs. The topologies in Storm execute until there is some kind of a disturbance or if the system shuts down completely. Storm does not run on Hadoop clusters but uses ZooKeeper and its own minion worker to manage its processes. Storm can read and write files to HDFS.
Apache Hadoop is hot in the big data market but its cousins Spark and Storm are hotter.