ref:http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php
Hadoop consist of two main pieces, HDFS and MapReduce
- The HDFS is the data part of Hadoop and the HDFS server on a typical machine is called a DataNode
- The MapReduce is the processing part of Hadoop and the MapReduce server on a typical machine is called TaskTracker
MapReduce needs a coordinator which is called a JobTracker.
JobTracker
1. responsible for accepting user's job, dividing it into tasks and assigning it to individual TaskTracker. TaskTracker will run the task and
reports the status as it runs and completes.
2. responsible for noticing if the TaskTracker disappears because of software failure or hardware failure. It needs to reassign the task to
another TaskTracker
The NameNode takes the similar role to HDFS as the JobTracker does to the MapReduce
Since the hadoop project was first started, lots of other software has been built around it. Lots of them designed to make Hadoop easier to use, people who are not programmers or are businessman can use Hadoop
1. several open source project have been created to make it easier for people to query their data without having to write macros and reducers.
Hive: in Hive we just write a statement which looks like standard SQL query:
select * from ...
The Hive interpreter truns the SQL into map reduce code, which then runs on the cluster. Facebook use it intensely.
Pig: allows us to write code to analyse our data in a fairlry simple scripting lanaguage, rather than map reduce.
It is a high-level language for routing data. The code is just turned into mao reduce and run on cluster. It wirds like a compiler which translates our program into an assembly. So, the Pig does the same thing for MapReduce jobs.
** Though Hive and Pig are great, they are still running map reduce jobs, and can take a resonable time to run, especially over large amounts of data.
That is why another open source project called Impala created. It was developed
as a way to query data with SQL directly from HDFS, it does not run map reduce program. Impala is optimized for low latency queries.
Therefore, impala queries run very faster than Hive, while Hive is optimized for running long batch processing jobs.
Sqoop takes data from traditional relational databases such as Mirosoft SQL server and puts it in HDFS
Flume is for streaming data into Hadoop. It injects data as it's generated by external systems
and put it into the cluster. So if we have severs generating data continuously, we can use Flume. Like reading facebook, twitter data into HDFS
Hbase is a real time databse, built on top of HDFS, It's colum-family stored on the Google's BigTable design.
we need to read/write data in real time and HBase is a top-level Apache project meets that need. It provides a simple interface to our distributed data that allows incremental processing. HBase can be accessed by Hive and Pig by MapReduce and stores that information in its HDFS and it's guranteed to be reliable and durable. HBase is used for application such as Facebook messages.
KijiSchema provides a simple Java API and comman line interface for importing, managing, and retrieving data from HBase by setting up HBase layouts using user-friendly tools including a DDL
Hive: Hive is a data warehouse system layer built on Hadoop. It allows us to define a structure for our unstructureed
Big Data. With a HiveQL which is an SQL-like scripting language, we can simplify analysis and queries.
Hive is not a databse but uses a databse to store metadata. The data that Hive processes isstored in HDFS, Hive runs on Hadoop and
is NOT designed for on-line transaction processing because the latency for Hive queries is generally high. Therefore, Hive is NOT suited for real-tme queries. Hive is best suited for batch jobs over large sets of immutable data such as web logs
Hue: is a graphical front end to quester
Oozie: is a workflow scheduler tool it provides workflow/coordination service to manage Hadoop jobs. So, we define when we want our MapReduce jobs to run and Oozie will fire them up automatically. It also will trigger when data becomes avaliablel
Mahout: is a library for scalable machine learing and data mining
Avro: is a Serialization and RPC framework
** In fact, there are so many ecosystem projects that making them all talk to one another, and work well can be tricky
To make installing and maintaining a cluster like this easier, a company called Cloudera, has put together a distribution of HADOOP called CDH(Cloudera distribution including a patchy HADOOP) takes all the key echosystem projects, along with Hadoop itself, and packages them together so that installation is a really easy process. It is free and open source, just like Hadoop itself, While we coild install everything from scratch, it is far easier to use CDH
Zookeeper: allows distributed process to coordinate with each other through a shared hierarchical name sapce of data registers
Kafka: kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a message system, but with a unique design.
Data center orchestration - Mesos
Mesos is built using the same principle as the linux kernal, only at different level of abstraction. The Mesos kernel runs on every machine and provides application(e.g. Hadoop, Spark, Kafka, Elastic Search) with API's for resource management and scheduling across entire datacenter and cloud environments.