zoukankan      html  css  js  c++  java
  • SQL for HBase

    本文转自:http://www.orzota.com/sql-for-hbase/

    HBase provides random, read-write access to Big Data stored in very large tables as a distributed columnar store. HBase thus gained immediately popularity as a Big Data technology that unlike Hadoop which was primarily used in the backend data warehousing infrastructure could be deployed to service online transactions as well. Just as Hive brought SQL to Hadoop (and now to HBase as described in my earlier blog post), there are many alternate projects providing SQL for HBase which is the subject of this blog article.

    Phoenix

    Phoenix is a technology developed by Salesforce.com to put a SQL skin over HBase. This is not adding a new layer, rather it exposes HBase functionality through SQL using an embedded JDBC Driver that allows clients to run at native HBase speed. The JDBC driver compiles SQL into native HBase calls.

    The claim is that performance of Phoenix is much better than that of Hive over HBase – upto 10x at 100 million rows. Performance is also more constant as the number of rows increases. Optimizations are transparent to the user.

    Impala

    Cloudera has been promoting Impala heavily since its announcement in Strata last year. Impala is a SQL engine that can run on HDFS or HBase or both. They too claim that Impala performance is much better than that of Hive – over 45 times! However this comparison seems to be based on native Hive, not Hive over HBase. The performance gains are achieved by not writing intermediate results to disk (like MapReduce), no spin up/down times, optimized code as Impala is written in C++ rather than Java. This means that it needs to use JNI to make calls into Java for HBase integrations.

    Impala also includes authentication via Kerberos support, JDBC/ODBC drivers (similar to Phoenix), an interactive shell and support for popular hadoop file formats such as parquet, avro, thrift and sequence files. Parquet was specifically engineered for Impala to get the best performance.

    The use cases for HBase and Impala include the following:

    • Data is being streamed in
    • Need to run queries on data as it is being created
    • Need real time edits of data in-place
    • Need random access to computed results (results can be placed in HBase)

    Drill

    Apache Drill (the only in this series which is an Apache project) was inspired by Google Dremel. It is the most interesting in terms of its emphasis on interactive analysis of large scale datasets. It is similar to Impala in many ways but is community driven. Although Hive is also a community project, the need for Drill rose because of the tight coupling of Hive to MapReduce which is based on a pessimistic execution principle i.e. all jobs will be long running and process lots of data. This causes a lot of overhead for short-running jobs which Drill addresses.

    Drill does great when execution times of 100ms to 10 seconds are required. Phoenix provides even better response times – below 10ms.

    Drill is unique in its focus on analytical queries, so a lot of emphasis is placed on bulk reads and aggregations. Phoenix is more focused on write optimizations as in the RDBMS world.

    Drill also provides extensions to SQL for nested data types similar to Big Query. Many systems nest data structures inside of HBase cells (for example a cell can be a whole log record which in turn has many nested fields). It is much better to treat these nested cells and process them in the query language for optimal performance.

    Drill leverages recent research approaches (late record materialization, vectorized operators, etc.) As such it definitely seems the most promising for analytical work.

    Summary

    With all of these different options and rich technologies that provide SQL for HBase, HBase’s popularity is bound to grow. Of course with choice comes decisions. Which technology is right for you? Orzota can help analyze your use case and recommend the right set of technologies based not just on technical merits alone, but also taking into consideration its maturity, cost and vendor support. Please contact usfor more information.

  • 相关阅读:
    Apple Swift编程语言入门中文教程
    WWDC 2014 Session 208/231 CloudKit 读书笔记
    微信开源项目解说使用公开课
    winform 实现彩票功能
    c#+windows api SetWindowsHookEx 全局钩子 demo 下载
    GPS-Graph Processing System Graph Coloring算法分析 (三)
    00105_UDP和TCP协议
    雷林鹏分享:jQuery EasyUI 表单
    雷林鹏分享:jQuery EasyUI 表单
    雷林鹏分享:jQuery EasyUI 表单
  • 原文地址:https://www.cnblogs.com/kivi/p/3214097.html
Copyright © 2011-2022 走看看