Hive桶表 - 走看看

zoukankan html css js c++ java

Hive桶表
对于Table或者Partition， Hive可以进一步组织成桶，也就是说桶Bucket是更为细粒度的数据范围划分。Bucket是对指定列进行hash，然后根据hash值除以桶的个数进行求余，决定该条记录存放在哪个桶中。

优点①：获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在相同列上划分了桶的表，可以使用 Map-side Join 的高效实现。

优点②：抽样（sampling）可以在全体数据上进行采样，这样效率自然就低，它还是要去访问所有数据。而如果一个表已经对某一列制作了bucket，就可以采样所有桶中指定序号的某个桶，这就减少了访问量。

缺点：使用业务字段来查询的话，没有什么效果。

1). 设置环境变量

让程序自动分配reduce的数量从而适配相应的bucket
set hive.enforce.bucketing = true;
2). 创建桶表

使用 Clustered By 子句来指定划分桶所用的列，以及划分桶的个数。桶中的数据可以根据一个或多个列进行排序Sorted by【此处默认是降序】。由于这样对每个桶的连接变成了高效的归并排序(merge-sort)，因此可以进一步提升map端连接的效率。
hive> create table student0(id INT, age INT, name STRING) > partitioned by(stat_date STRING) > row format delimited fields terminated by ','; OK Time taken: 0.292 seconds
hive> create table student1(id INT, age INT, name STRING) > partitioned by(stat_date STRING) > clustered by(id) sorted by(age) into 2 buckets > row format delimited fields terminated by ','; OK Time taken: 0.215 seconds
3). 导入数据

桶表 student1 加载数据 From Select 是经过MR的，而普通表 student0 加载数据 Load 是不需要启动MR的

事实上，桶表数据文件对应MR的 Reduce输出文件：桶n 对应于输出文件 000000_n
[root@hadoop01 hive]# more bucket.txt 1,20,zxm 2,21,ljz 3,19,cds 4,18,mac 5,22,android 6,23,symbian 7,25,wp
hive> LOAD data local INPATH '/root/hive/bucket.txt' > OVERWRITE INTO TABLE student0 > partition(stat_date="20120802");
hive> from student0 > insert overwrite table student1 partition(stat_date="20120802") > select id,age,name where stat_date="20120802" > sort by age;
4). 查看文件目录
hive> dfs -ls /user/hive/warehouse/student1/stat_date=20120802; Found 2 items -rw-r--r-- 1 root supergroup 31 2015-08-17 21:23 /user/hive/warehouse/student1/stat_date=20120802/000000_0 -rw-r--r-- 1 root supergroup 39 2015-08-17 21:23 /user/hive/warehouse/student1/stat_date=20120802/000001_0
hive> dfs -text /user/hive/warehouse/student1/stat_date=20120802/000000_0; 6,23,symbian 2,21,ljz 4,18,mac
hive> dfs -text /user/hive/warehouse/student1/stat_date=20120802/000001_0; 7,25,wp 5,22,android 1,20,zxm 3,19,cds
5). 查看tablesample数据
hive> select * from student1 > TableSample(bucket 1 out of 2 on id); OK 6 23 symbian 20120802 2 21 ljz 20120802 4 18 mac 20120802 Time taken: 10.871 seconds, Fetched: 3 row(s)
注：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y)

y必须是桶数的整数倍或者因子。hive根据y的大小，决定抽样的比例。例如，桶数64：
- 当y=32时，抽取(64/32=)2个bucket的数据
- 当y=64时，抽取(64/64=)1个bucket的数据（此例子就是1）
- 当y=128时，抽取(64/128=)1/2个bucket的数据
x表示从哪个bucket开始抽取。例如，桶数64，tablesample(bucket 3 out of 32)，表示：
- 总共抽取（64/32=）2个bucket的数据，分别为第3个bucket和第（3+32=）35个bucket的数据。
- 此例子中，总共抽取（2/2=）1个bucket的数据，并且是第一个桶中的数据。
查看全文

相关阅读:
Vue学习笔记vueelementadmin 前端学习
 Vue学习笔记Vue.js2.X 学习(三)===>组件化高级
 Vue学习笔记rest_framework_jwt 学习
 Vue学习笔记Django REST framework3后端接口API学习
 Vue学习笔记Vue.js2.X 学习(一)===>基本知识学习
 Vue学习笔记Vue.js2.X 学习(二)===>组件化开发
 Vue学习笔记Windows系统Git安装(按装vueelementadmin报错)
跑马灯
 使用信号量的线程同步实验
 按键盘数码管显示实验

原文地址：https://www.cnblogs.com/skyl/p/4737847.html