zoukankan      html  css  js  c++  java
  • Big Data Ingestion and streaming product introduction

    Flume

    Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.

    Chukwa

    Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.

    Sqoop

    Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.

    Kafka

    Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.

    Storm

    Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.

  • 相关阅读:
    androidstudio配置http proxy以及配置gradle
    发布jar到docker的方法
    idea直接发布项目到docker中
    安卓启动相机报错android.os.FileUriExposedException: file:///storage/emulated/0/
    centos 7.4搭建Harbor、push docker镜像以及常见错误
    dock的卸载与安装
    centos部署Kubernetes(k8s)集群
    Logback将日志输出到Kafka配置示例
    JavaScript设计模式 Item 5 --链式调用
    你不知道的JavaScript--Item27 异步编程异常解决方案
  • 原文地址:https://www.cnblogs.com/jiangu66/p/3196624.html
Copyright © 2011-2022 走看看