zoukankan html css js c++ java

hive 学习笔记

下文中的部分例子来源于《hive学习指南》和易百教程，但是总结大部分是自己写的。

hive　官方文档：

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTableProperties

易百教程：https://www.yiibai.com/hive

hive 学习笔记：

一、基础

1. 数据类型：

类型	解释
tinyint	1 byte 整数
smallint	2
int	4
bigint	8 byte
boolean	boolean
float	单精度浮点数
double	双进度浮点数
timestamp	整数、浮点数或字符串
binary	字节数组
string	字符串
decimal

集合：

数据类型	描述	语法示例
struct	C里的结构体，类似与class，可以用. 访问元素	struct('a', 'b', 'c')/ DDL: struct<street:string, city:string, zip:int>
map	k-v集合，可以用[key]访问元素	map('firt':'join', 'last':'kobe') / DDL : map<string, float>
array	数组[a, b] ,可以i用 d[0]访问a元素，d[1]访问b元素	array('john', 'kobe') /DDL：array<string>

二、HQL

2 . database 创建、删除 :

创建：

1 create database db_name with DBproperties('creator'='', 'date'='')

(示例 create SCHEMA learn_hive3 with DBPROPERTIES('creator'='hanks', 'date' = '2019-10-16 16:30:00'))

显示数据库db_name所在文件路径：

describe database db_name;

返回了一个db_name, comment, location (hdfs://localhost:9000/usr/hive/ware/learn_hive3.db), owner_name, owener_name, parater

使用当前db：

use db_name;

（设置属性cli中显示当前数据库：set hive.cli.print.current.db = true)

删除数据库：

drop database IF EXIST db_name CASCADE;

(示例： DROP database if EXISTS learn_hive2 CASCADE;)

当表是内部表时，drop会删除表的内的数据和元数据; 但是当表是外部表，数据依然在文件系统里，但是元数据会被删除

（注：默认是RESTRICT, 　当hive中含有表时，删除失败；　CASCADE：先自动删除表，再删除数据库）

修改数据库：修改数据库属性，通过 alter databa 设置 DBPROPERTIES（key=value)

alter database db_name set DBPROPERTIES('editor'="hanks")

3. 创建表、修改表、删除表：

创建管理表/内部表 ：

CREATE TABLE IF NOT EXIST db.table_name(
col_name   data_type COMMENT 'the comment',
col_name2 data_type COMMENT 'the comment',
)
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]


注意：加 if not exist , hive忽略表 大小写

创建外部表

create external table if not exist db.table_name/table_name();

注意：

管理表存储于：hive.metastore.warehouse.dir，内部表/管理表不适合数据共享

外部表：hive不任务完全拥有数据，删除表时不会删除数据（除非，删除元数据）

示例：表创建、load data 、 select map

create external TABLE if not EXISTS cus_user(
cid int comment 'the user id ',
use_name string comment 'name of customer',
age string comment ' the age of customer',
info map<string, string>,
hobby array<string>
)
comment 'customer table '                                         
ROW format delimited   
fields terminated by '	'  
COLLECTION ITEMS TERMINATED BY '-'   
map keys terminated by ','
lines terminated by '
'
stored as textfile
comment 'customer table '                                         # 表注释

ROW format delimited   # 行格式字段

fields terminated by '
'  # 字段终止符号（列分隔符）

COLLECTION ITEMS TERMINATED BY '-'               # 集合分隔副， 所有集合元素之间分隔（例如数据array为 a:b:c, 识别后:[a,b,c]
map keys terminated by ':' #map的 k-v 的分隔符

lines terminated by '
'# 行终止符， 行终止符必须在其他分隔符后

stored as textfile# 保存的类型文件，   文件为纯文本时 textfile，序列：sequence ; 大多数为textfile

注意：

MAP KEYS TERMINATED BY ':' # map 集合中，key-value 分隔符为 :

COLLECTION ITEMS TERMINATED BY '-' # 集合中， map里的item之间分隔， array各元素之间，struct各元素之间

重点：row format delimited, 必须在其它子句之前（除stored as） 行终止符，必须在其他终止符后面！！！！！！！！！！！！！

数据:（因为列分割符为',' 为了区分各列， map的key-value之间以 ':' 分割，map的item之间用'-'分割 array元素之间用'-'分割，

1,tom,22,company:aws-partner:it, baseball-pingpang-basketball
2,kobe,35,company:apple-partner:it, swim-jump
3,tim,25,company:google-partner:it, football-soccer
4,tony,33,company:sk telecom-partner:it, pingpang-soccer
5,lebron,18,company:softban-partner:it, soccer-baseball
6,bronny,15,company:braun-partner:hr, basketball-swim-football

describe extended table_name                             # 查看是否是 管理表/外部表

拷贝前一张表：show columns from use_name1;

create table if not exist db2  like db1 # 复制表db1，命名位db2
create external table if not exist db2  like db1

load 数据：

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename   
[PARTITION (partcol1=val1, partcol2=val2 ...)]


# overwrite， 是否覆盖插入前的数据
load data local inpath 'file_path' into table learn_hive3.cus_user;  # file_path 为文件路径

select info['company'] from learn_hive3.cus_user;          # 这样就可以查看 k

修改表：

# 表重命名
alter table table_name rename to new_name;                         
    exp: alter table  user_name rename to customer

# 添加列
alter table table_name add columns (col_1  , col_2,, )             
    exp: alter table learn_hive3.customer add COLUMNS (address string, company string)
# 修改列
alter table table_name change col_name  new_name new_type     
    exp：alter table table_name change c_id  c_ids  string    ( 改变名称，且 int ->string   )

# 替换列/删除列, 注意：　删除列只能用replace, 而不能用drop 但是网上大多数例子都是drop
alter table table_name  replace columns (col1_name  data_type new_name data_type, col2_name)

    alter table learn_hive3.customer replace columns(u_id int, user_name string, u_age string, address string, company string)
    # 删除前customer(u_id, user_name, u_age, address string, company string, c_id), 将c_id删除

 # 修改表注释
ALTER TABLE table_name SET TBLPROPERTIES table_properties;  # table_properties: (property_name = property_value, property_name = property_value, ... )
exp：
    ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);

注意：以下情况数据类型转换是失败的（数据从高阶->低阶不允许）：

int -> float, double 成功，反之失败；

int、float、double -> string 成功，反之失败；

mysql中的char(10) -> char(20)成功，反之失败；

不论表或分区的保护模式如何，ALTER TABLE ADD或REPLACE COLUMNS CASCADE都会覆盖表分区的列元数据。谨慎使用

删除表：

drop table  [if  exists]  table_name

4.分区表（Partition)

数据分区使用分区来水平分散压力，将数据从物理上转移到和使用最频繁的用户最近的地方。

可以显著提高查询性能

创建分区/外部分区表加external ：

create[external] table  tb_name (col1 type, col2 type) 
partitioned by (col1 type , col2 type)

用一个示例来解释原理：

exp:

create table employee(
    name string,
   salary float,
   subordinates  array<string>,
   address   struct<stree:string ,  city: string,  state: string,  zip: int>        
)
partitioned by (country string,  state string);

上述的例子是将 country、 state作为分区依据， employee表下就会有分区结构的子目录：

例如：

/employee/country=us/state=AA

/employee/country=ca/state=BB /employee/country=cn/state=CC

相当于使用country值不同生成不同的目录，再根据state不同依次产生不同的目录；

当查询时：

select * from employee where country='US' and state = 'IL'

hive会在表employee ---------------> 查询country='US'的目--------->再在US目录下查找 state = 'IL'目录------------>查到后会获取IL目录下的所有内容

此时where country='US' and state = 'IL', 含有分区字段的where过滤，叫 分区过滤器

BUT：当分区字段过多时，不加where区分查询时，会触发巨大的map-reduce任务；hive建议提高安全措施是将hive设置未 “严格（strict）模式”（where中没有分区过滤就不会提交）

how to set？ set hive.marpre.mode = strict ; (非严格模式， nonstrict)

查看分区：

show partitions *tb_name*
describe extend *tb_name*

表添加分区：

alter table tb_name  add partition( col =  key1, col2 = key2,....., coln = keyn)

exp:

alter table log_message add partition(year = 2012, month =  1, day =20 )  set  location '/logs/2012/1/20'   # 指定位置

删除分区：

alter table tb_name drop  [if exists] partition( col =  key1, col2 = key2,....., coln = keyn)

exp:

alter table log_message drop if exists  partition(year = 2012, month =  1, day =20 )

修改分区表：

修改存储属性：

分区存储 ---------------------> Sequence File

alter table tb_name  PARTITION( key1 = val1, key2 = val2) set FILEFORMAT SEQUENCEFILE

修改SerDe属性：

alter table  tb_name [partition partition_specs] SET SERDEPROPERTIES (property_name=property_value,...)]

注意: SerDe属性存在时，新的会覆盖旧的值

5 数据操作

数据装载：

load data [local] inpath  'file_path ' [overwrite] into table  tb_name [partition  partition_specs]

带有partition时，没有分区会先创建分区目录，再将数据拷贝到该目录下

那么load data其实就是将外部分文件 copy/移动到目标文件目录中，是个copy的操作; local 时本地文件系统的操作。

特别注意： load data local : 是 copy； load data：将数据从一个集群的HDFS中转移到另一个集群的HDFS中

查询插入（insert ..... select )

exp:

insert overwrite table employee partition(country = 'US', state = 'OR') select * from stage_employee se where se.cnty = 'US' and se.st = 'OR'

数据导出

hadoop fs -cp  source_path  target_path

或者 insert .... dictory ...

exp:

insert overwrite local dictory  '/tmp/ca_employees'
select name, salary, address  from empolyees where se.state='CA'

# selet 查询出你要导出的数据， insert ...  dictory....导出到目标路径
# 也可以用多重  insert... dictory ... select 导出多个输出文件目录

example:
FROM stated_employees se
insert overwrite local dictory '/tmp/ca_employees'
    select *  where se.cty = 'US' and se.st = 'OR'
insert overwrite local dictory '/tmp/ca_employee'
    select * where  se.cty = 'US' and se.st = 'CA'

6 HQL查询

select ... where

select where 和MySQL中的基本一致：

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

区别的是：

cluster by col_list: 兼distribute by和sorted by 功能

distribute by col_list：和 sorted by 连用，是控制map的输出在reducer中如何划分的。比如distribute by country , 将country 相同的放入同一个reducer中排序。

sorted by col_list ： sorted by 和 order by功能相似，只不过sorted by是局部排序，在每个reducer中对数据排序.

下面是需要注意的点：

这里需要主义的就是 hive中的复杂结构， array、map和struct:

array: 查询字段为array，返回的是数组/list（值在[...]内，逗号分隔），其中元素符号是加引号的。a[0]访问数组第一个元素（索引从0开始）

map：json表示map（类似Python Dict）；访问某个value使用 col_name[key]

struct: 也是json格式， struct 与C中的结构体相似，使用.运算符访问 (exp: struct名为info，其中有个key为"address", 访问address对应的值： info.address）

where 语句中不能用列别名，但是可以用嵌套的select语句

select e.* from 
	（select name , salary deductions['Federal Texes'] as ded, salary * (1 - dedution['Federal Texes']) as salary_minus_fed_texes from  employees) e
where round(e.salary_minus_fed_texes) >  7000

浮点数比较

当where过滤条件中含有浮点数比较时，可能会出现错误。

例如: 薪水的减免税款超过 20 %. where deductions['Federal Texes'] > 0.2 ，其实是会返回含有 0.2 的数据， why？

因为hive将 0.2保存为double, 而 deductions为 float，而double ----> float 转换后的浮点数 0.20000001 实际上是比 0.2大的。怎么解决/避免这类问题？
1. textfile存储时， hive将字符串 "0.2"转为double，表模式定义为double，两个double就能比较；但是会增加查询内存消耗
2. 将 0.2 显式定义为float , 数值 as float . cast(0.2 as float) : where deductions['Federal Texes'] > cast(0.2 as float )
RLIKE 正则表达时的like

join

hive的join 和 SQL形式也是基本一样.

inner join :

select ... from  a  join b on a.col = b.col  join c on a.col2 = c.col2 where .......

hive对每一对join对象启动一个 MapReduce任务，比如上面这个，先启动一个MapReduce任务来join 表a 和表b，再启动一个MapReduce任务来前一个 MapReduce的输出和表c join。hive 总是按照从左到右顺序执行

lef/right out join：返回符合左表/右表所有符合where条件的内容，另一个表没有的数据为NULL

select ... from a left/right outer join b on a.col = b.col where ....

left semi-join

select  a.col, a. col2  from a  left semi-join b on a.col = b.col where a.col  =  ''  and a.col2 = ''

left semi-join是左半开连接，其select 和where不能含有右表字段；其原理：左表中一条指定的记录在右表中找到匹配的，hive会停止扫描。

笛卡尔集join

select * from a join b

结果是 a数据纬度 × b数据纬度

map-side join

　　针对于一个维度很大的表和维度很小的表，　小表可以加载到内存中，hive就可以在小表中匹配，减少了在reduce的join时间。（后面有必要再好好研究）

分桶和抽样查询

分桶：分桶是表和分区下更细粒度的划分，指定分桶表的列和分桶个数，按照哈希方式（对分桶数模）来提升查询效率.

clustered by(col col_type) into N buckets # 根据某列分桶，分桶数为N。

示例：

create table weblog (
    user_id int , 
    url string,
    source_ip string
)
partitioned by(dt string)
clustered by (user_id string) into  96 buckets

抽样查询：使用rand函数进行抽样。示例：

select * from numbers TABLESAMPLE(BUCKET 3 out of 10  ON rand()  ) ;

7 视图（View， hive 3.0 以后去除了视图，并建议用HBase或者impala）：

视图是逻辑结构，是种虚表，解决复杂查询问题；当查询过于长或者复杂时，使用视图将当前查询分隔为多个可控制的片段。

创建：

create  view  view_name(col1_name, col2_name)  as  select  col1, col2 from .... where ....

示例：

创建一个IT部门员工的视图，下次针对IT部门员工就可以直接在视图基础上操作。employee_info_it_partner: 视图名；

create view employee_info_it_partner(c_id, c_name, age, company, hobby) as 
select cid, use_name, age, info['company'],hobby 
from learn_hive3.cus_user where info['partner'] = 'it';

select c_name from learn_hive3.employee_info_it_partner;

删除：

drop view [IF EXISTS] view_name

修改视图属性：

alter view  view_name set TBPROPERTIES('creator' = 'michael jordan')

查看视图：

show views;

注意： 1、视图不能插入数据，只能修改视图属性; 2、创建视图的select中含有limit数大于创建视图里的limit数，只显示创建里的limit数

查看全文

相关阅读:
CSS3的[att$=val]选择器
 CSS3的[att^=val]选择器
 CSS3的[att*=val]选择器
 CSS3的[att=val]选择器
 web报表工具finereport常用函数的用法总结（数组函数）
web报表工具finereport常用函数的用法总结（数组函数）
人性多面性的终极教材——北漂18年（4）
第23章 MYSQL结构
 11g OS文件移动到ASM
Oracle 10g TAF配置

原文地址：https://www.cnblogs.com/HankCui/p/11806324.html