1.
2.表
2.1 常用命令
查看描述信息 # desc formatted xxx
建立表
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive
0.14
.
0
and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment] --表描述
[PARTITIONED BY (col_name data_type [COMMENT col_comment],...)] --表分区设置
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY
'storage.handler.class.name'
[WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive
0.6
.
0
and later)
]
[LOCATION hdfs_path] --数据文件存于HDFS上的位置,如果不指定,默认存放于/user/hive/warehouse/数据库名.db/表名下
[AS select_statement]; -- (Note: Available in Hive
0.5
.
0
and later; not supported
for
external tables)
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive
0.7
.
0
and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive
2.2
.
0
and later)
| STRING
| BINARY -- (Note: Available in Hive
0.8
.
0
and later)
| TIMESTAMP -- (Note: Available in Hive
0.8
.
0
and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive
0.7
.
0
and later)
row_format:
DELIMITED
[LINES TERMINATED BY
char
] --行分割符.默认
[FIELDS TERMINATED BY
char
[ESCAPED BY
char
]] --列分割符
[COLLECTION ITEMS TERMINATED BY
char
] -集合对象切割符
[MAP KEYS TERMINATED BY
char
] --Map对象切割符
[NULL DEFINED AS
char
] --空值填充字符
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value,...)]
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.
default
.fileformat configuration)
| RCFILE -- (Note: Available in Hive
0.6
.
0
and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
本地数据导入
# LOAD DATA [LOCAL] INPATH '本地路径' [OVERWRITE] INTO TABLE 目标表
[LOCAL]有代表从本地文件系统上导入(拷贝),否则代表从HDFS导入(是移动,源文件会被删掉,切记切记)
[OVERWRITE]有代表数据全覆盖,否则代表数据追加
# insert into table 表明 [分区设置]
select 字段... from 目标表 ;
语法要求目标表必须存在,
# create table xxx as select .....
直接将select查询结果创建为表
语法要求:
目标表必须不存在
目标表不能是分区表,外部表或者list bucketing表
2.2 外部表&内部表
内部表(MANAGED_TABLE)
内部表同时管理MetaData和SourceData.删除后,将同时删除MetaData和SourceData
外部表(EXTERNAL)
外部表只管理MetaData.删除后,只删除MetaData.不会影响到SourceData
以数据导入方式后,数据文件文件默认存放于/user/hive/warehouse/数据库名.db/表名下(表自身文件夹)
直接于HDFS放入文件,内外部表都可以正确读取直接放入的文件数据.只是如果是内部表,删除会将自行放入的文件数据也一起删除(删除的是表文件夹)
2.3 分区表
2.3.1 简述
分区在HDFS中体现为表的子文件夹.
分区的最大优势在于降低IO,因为以分区字段筛选数据时,可以直接跳过读取和计算整个分区下的数据.特别是大数据情况下,良好的分区可以显著的提供计算性能
2.3.2 分区表建立
create table order_partition.....常规建表
partitioned by (分区字段,分区字段类型........)
语法要求:分区字段不得与任何非分区字段同名
2.3.3 静态分区
静态分区是手动指定分区,将数据直接放入目标分区.
LOAD DATA LOCAL INPATH '/home/hadoop/data/order.txt'
OVERWRITE INTO TABLE order_partition
PARTITION(event_month='2014-05');
2.3.4 动态分区
LOAD DATA LOCAL INPATH '/home/hadoop/data/order.txt'
OVERWRITE INTO TABLE order_partition
PARTITION(event_month='2014-05');