zoukankan      html  css  js  c++  java
  • Big Data Ingestion and streaming product introduction

    Flume

    Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.

    Chukwa

    Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.

    Sqoop

    Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.

    Kafka

    Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.

    Storm

    Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.

  • 相关阅读:
    活用 %取余 用于变量循环
    jquery滚动条固定在某一位置
    jquary中各种相似操作介绍
    jquary中滚动条滚动到底部
    JSON到字符串之间的解析
    ecshop删去版权等信息
    分享代码和网页地图
    ecshop版权的修改,头部,底部
    ECSHOP中transport.js和jquery的冲突的简单解决办法
    从Eclipse迁移到Android Studio碰到的问题记录
  • 原文地址:https://www.cnblogs.com/jiangu66/p/3196624.html
Copyright © 2011-2022 走看看