Hive的数据模型之分区表
准备数据表:
create table sampledata (sid int, sname string, gender string, language int, math int, english int) row format delimited fields terminated by ',' stored as textfile;
准备文本数据:
sampledata.txt 1,Tom,M,60,80,96 2,Mary,F,11,22,33 3,Jerry,M,90,11,23 4,Rose,M,78,77,76 5,Mike,F,99,98,98
将文本数据插入到数据表:
hive> load data local inpath '/root/pl62716/hive/sampledata.txt' into table sampledata;
-partition对应于数据库中的Partition 列的密集索引
-在Hive中,表中的一个Partition对应于表下的一个目录,所有的Partition的数据都存储在对应的目录中。
创建分区表:
create table partition_table (sid int, sname string) partitioned by (gender string) row format delimited fields terminated by ',';
向分区表中插入数据:
hive> insert into table partition_table partition(gender='M') select sid, sname from sampledata where gender='M'; hive> insert into table partition_table partition(gender='F') select sid, sname from sampledata where gender='F';
从内部表解析比从分区表解析效率低:
(1)内部表:
hive> explain select * from sampledata where gender='M'; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: sampledata Statistics: Num rows: 1 Data size: 90 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (gender = 'M') (type: boolean) Statistics: Num rows: 1 Data size: 90 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: sid (type: int), sname (type: string), 'M' (type: string), language (type: int), math (type: int), english (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 1 Data size: 90 Basic stats: COMPLETE Column stats: NONE ListSink Time taken: 0.046 seconds, Fetched: 20 row(s)
(2)分区表:
hive> explain select * from partition_table where gender='M'; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: partition_table Statistics: Num rows: 2 Data size: 13 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: sid (type: int), sname (type: string), 'M' (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 2 Data size: 13 Basic stats: COMPLETE Column stats: NONE ListSink Time taken: 0.187 seconds, Fetched: 17 row(s)