zoukankan      html  css  js  c++  java
  • Hadoop 1 ecosystem

    ref:http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php

     

    Hadoop consist of two main pieces, HDFS and MapReduce

    • The HDFS is the data part of Hadoop and the HDFS server on a typical machine is called a DataNode
    • The MapReduce is the processing part of Hadoop and the MapReduce server on a typical machine is called TaskTracker

    MapReduce needs a coordinator which is called a JobTracker.

    JobTracker

      1. responsible for accepting user's job, dividing it into tasks and assigning it to individual TaskTracker. TaskTracker will run the task and 

        reports the status as it runs and completes.

      2. responsible for noticing if the TaskTracker disappears because of software failure or hardware failure. It needs to reassign the task to 

          another TaskTracker

    The NameNode takes the similar role to HDFS as the JobTracker does to the MapReduce

     

    Since the hadoop project was first started, lots of other software has been built around it. Lots of them designed to make Hadoop easier to use, people who are not programmers or are businessman can use Hadoop

    1. several open source project have been created to make it easier for people to query their data without having to write macros and reducers.

    Hive: in Hive we just write a statement which looks like standard SQL query:

    select * from ...

    The Hive interpreter truns the SQL into map reduce code, which then runs on the cluster. Facebook use it intensely.

    Pig: allows us to write code to analyse our data in a fairlry simple scripting lanaguage, rather than map reduce. 

    It is a high-level language for routing data. The code is just turned into mao reduce and run on cluster. It wirds like a compiler which translates our program into an assembly. So, the Pig does the same thing for MapReduce jobs.

    ** Though Hive and Pig are great, they are still running map reduce jobs, and can take a resonable time to run, especially over large amounts of data.

     That is why another open source project called Impala created. It was developed

    as a way to query data with SQL directly from HDFS, it does not run map reduce program. Impala is optimized for low latency queries.

    Therefore, impala queries run very faster than Hive, while Hive is optimized for running long batch processing jobs.

     Sqoop takes data from traditional relational databases such as Mirosoft SQL server and puts it in HDFS

     Flume is for streaming data into Hadoop. It injects data as it's generated by external systems

    and put it into the cluster. So if we have severs generating data continuously, we can use Flume. Like reading facebook, twitter data into HDFS

     Hbase is a real time databse, built on top of HDFS, It's colum-family stored on the Google's BigTable design.

     we need to read/write data in real time and HBase is a top-level Apache project meets that need. It provides a simple interface to our distributed data that allows incremental processing. HBase can be accessed by Hive and Pig by MapReduce and stores that information in its HDFS and it's guranteed to be reliable and durable. HBase is used for application such as Facebook messages.

     KijiSchema provides a simple Java API and comman line interface for importing, managing, and retrieving data from HBase by setting up HBase layouts using user-friendly tools including a DDL

    Hive: Hive is a data warehouse system layer built on Hadoop. It allows us to define a structure for our unstructureed 

    Big Data. With a HiveQL which is an SQL-like scripting  language, we can simplify analysis and queries.

    Hive is not a databse but uses a databse to store metadata. The data that Hive processes isstored in HDFS, Hive runs on Hadoop and 

    is NOT designed for on-line transaction processing because the latency for Hive queries is generally high. Therefore, Hive is NOT suited for real-tme queries. Hive is best suited for batch jobs over large sets of immutable data such as web logs

    Hue: is a graphical front end to quester

    Oozie: is a workflow scheduler tool it provides workflow/coordination service to manage Hadoop jobs. So, we define when we want our MapReduce jobs to run and Oozie will fire them up automatically. It also will trigger when data becomes avaliablel

    Mahout: is a library for scalable machine learing and data mining

    Avro: is a Serialization and RPC framework

    ** In fact, there are so many ecosystem projects that making them all talk to one another, and work well can be tricky

    To make installing and maintaining a cluster like this easier, a company called Cloudera, has put together a distribution of HADOOP called CDH(Cloudera distribution including a patchy HADOOP) takes all the key echosystem projects, along with Hadoop itself, and packages them together so that installation is a really easy process. It is free and open source, just like Hadoop itself, While we coild install everything from scratch, it is far easier to use CDH

    Zookeeper: allows distributed process to coordinate with each other through a shared hierarchical name sapce of data registers

    Kafka: kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a message system, but with a unique design.

    Data center orchestration - Mesos

    Mesos is built using the same principle as the linux kernal, only at different level of abstraction. The Mesos kernel runs on every machine and provides application(e.g. Hadoop, Spark, Kafka, Elastic Search) with API's for resource management and scheduling across entire datacenter and cloud environments.

  • 相关阅读:
    PostgreSQL的德哥教程
    pgbench的使用简介
    pg生成日期序列
    在Intellij IDEA 12中Tomcat无法调试
    如何使用命令行来控制IIS服务的启动和停止
    SharePoint Support Engineer 常用技术点
    测试博文写作
    C#数字进制间与字符串类型相互转换
    [转载]INNO Setup 使用笔记
    unity3d 游戏开发引擎
  • 原文地址:https://www.cnblogs.com/morningdew/p/5633038.html
Copyright © 2011-2022 走看看