HiveQL(Hive SQL)跟普通SQL最大区别
一直使用PIG,而今也需要兼顾HIVE。网上搜了点资料,感觉挺有用,这里翻译过来。翻译估计不太准确,待自己熟悉HIVE后再慢慢总结。
* No true date/time data types, no interval types, and many missing UDFs for manipulating dates (e.g. ADD_MONTH)
* Strict type matching without support for automatic coercion or typed literals (e.g. CASE <bigint expr> WHEN 1 THEN ... END)
* All queries must reference a table (no 'dual' or table-less queries)
* No session-scoped temp tables
* No 'IN' predicate
* No 'FIND' string search function for producing the offset to a match
* No find/replace string functions for plain strings (i.e. not regex)
* XPATH UDFs cannot return a string representing an entire subtree in the DOM, which prevents composition.
* Few mechanisms for collapsing arrays to scalar types (e.g. 'join' complement of string 'split'; aggregations other than 'size' for numeric arrays; etc.)
粗略的翻译:
1.HiveQL没有真正的日期/时间类型,自增类型,以及操作日期和时间的一些函数如(ADD_MONTH)
2.HiveQL有着非常严格的类型匹配,不支持类型自动转换(如不支持: CASE big_int_number WHEN 1 THEN ... END),我的理解是big int类型不可以自动帮你转换为int
3.HiveQL只能对表进行查询,普通的SQL可以对结果集查询,如一般的嵌套查询)
4.HiveQL没有临时表的概念
5.HiveQL没有IN操作
6.HiveQL对于字符串没有FIND和REPLACE函数
7.HiveQL中的XPATH UDF不能够返回一个代表子DOM树的字符串实体,为了阻止composition.
8.Few mechanisms for collapsing arrays to scalar types (e.g. 'join' complement of string 'split'; aggregations other than 'size' for numeric arrays; etc.)
===========================================================================================================================================================
1.No windowing functions. IE, SUM(sales) OVER (PARTITION BY date). Its difficult to do a lot things common to warehousing, like a running sum, without having to write custom mappers/reducers or a UDF.
2.No regular UNION, INTERSECT, or MINUS operators.
3.Null values are treated differently than empty string, and are exported differently. IE, empty strings are exported as ' ' and nulls are exported as nulls. I know this isn't unique to Hive but still annoying when exporting data from Hive into another system.
4.No hierarchical/self referencing querying. I know most distributed computing solutions can't do this, but it can be very handy.
5.No Update or Delete statements.
6.Haven't been able to find any kind of cost-based explain plans. Running explain plans generally just shows the path of accessing data. Useful to some degree but it would be great if it was more advanced in that it could help the user understand which steps are causing the biggest slowdowns.
=======================================================================================================================================================================
1. For row format delimiter for line termination, it only supports ' '.
2. Hive does not support the ability to run a query that select from tables in more than one database.
3. Hive does not support sub-queries such as those connected by IN/EXISTS in the WHERE clause.
4. Hive does not support the truncation of data from a table.
===========================================================================================================================================================