zoukankan      html  css  js  c++  java
  • Lambda Architecture: Achieving Velocity and Volume with Big Data

    http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/

    Big data architecture paradigms are commonly separated into two (supposedly) diametrical models, the more traditional batch and the (near) real-time processing. The most popular technologies representing the two are Hadoop with MapReduce and Storm. However, a hybrid solution, the Lambda Architecture, challenges the idea that these approaches have to exclude each other. The Lambda Architecture combines a slow and fast lane of data processing to achieve the best of both worlds. Fast results and deep, large scale processing.

    Usually one or the other architecture has been implemented due to a business requirement. Commonly, business users or customers eventually arrive at the point where they either would like to get a more historic view or more real time insight either of which can not be provided by the deployed architecture. At this point a hybrid solution becomes the only realistic solution. One which brings some surprising benefits with it.

    Lambda Architecture explained

    The Lambda Architecture centrally receives data and does as little as possible processing before copying and splitting the data stream to the real time and batch layer. The batch layer collects the data in a data sink like HDFS or S3 in its raw form. Hadoop jobs regularly process the data and write the result to a data store.

     Lambda architecture duplicates incoming data and processes them in parallel at different speeds


    Lambda architecture duplicates incoming data and processes them in parallel at different speeds

    Since this process is fully batched the data store can have some significant simplification. It should support random reads, i.e. needs some kind of index, however, it can do away with random writing, locking, and consistency issues. This simplifies the store significantly. An example of such a system is ElephantDB.

    The problem with batch processing is the time it takes. For example, the above process may take hours or days. In the meantime data has been arriving and subsequent processes or services continue to work with hours or days old information. The real time layer solves this by taking its copy of the data and processing it in seconds or minutes and stores it in a fast random read and write store. This store is more complex since it has to be constantly updated.

    The complexity of the real time layer and it’s store is manageable since it only has to store and serve a sliding window of data, which needs to be roughly as long as the batch process takes. Both layers’ results are merged and real time information is replaced in favour of batch layer data. In many cases this enables for the real time process to work with good approximations since its results are replaced by highly precise data within a short period.

    Lambda Architecture benefits

    The addition of another layer to an architecture has major advantages. Firstly, data can (historically) be processed with high precision and involved algorithms without losing short-term information, alerts, and insights provided by the real time layer. Secondly, the addition of a layer is offset by dramatically reducing the random write storage requirements. The batch write storage provides also the option to switch data at predefined times and version data.

    Lastly and importantly, the addition of the data sink of raw data offers the option to recover from human mistakes, i.e. deploying bugs which write erroneous aggregated data from which other architectures can not recover. Another option is to retrospectively enhance data extraction or learning algorithms and apply them on the whole of the historic dataset. This is extremely helpful in agile and startup environments where MVPs push what can be done down the track.

  • 相关阅读:
    ZJOI2006 物流运输
    codevs 1403 新三国争霸
    (一) MySQL学习笔记:MySQL安装图解
    多线程同步
    SendMessage和PostMessage区别
    VS2008 MFC 配置 Gdiplus
    IE7常用的几个快捷键 你常用的是哪个
    匆匆的六年 收获了什么
    python 代码题07 sorted函数
    python 代码题06 回数是指从左向右读和从右向左读都是一样的数,例如12321,909。请利用filter()筛选出回数
  • 原文地址:https://www.cnblogs.com/dadadechengzi/p/12639176.html
Copyright © 2011-2022 走看看