Hive——join的使用

zoukankan html css js c++ java

Hive——join的使用
Hive——join的使用

hive中常用的join有：inner join、left join 、right join 、full join、left semi join、cross join、mulitiple

在hive中建立两张表，用于测试：
hive> select * from rdb_a; OK 1 lucy 2 jack 3 tony hive> select * from rdb_b; OK 1 12 2 22 4 32
一、基本join使用

1、内关联（[inner] join）：只返回关联上的结果
select a.id,a.name,b.age from rdb_a a inner join rdb_b b on a.id=b.id; Total MapReduce CPU Time Spent: 2 seconds 560 msec OK 1 lucy 12 2 jack 22 Time taken: 47.419 seconds, Fetched: 2 row(s)
2、左关联（left [outer] join）：以左表为主
select a.id,a.name,b.age from rdb_a a left join rdb_b b on a.id=b.id; Total MapReduce CPU Time Spent: 1 seconds 240 msec OK 1 lucy 12 2 jack 22 3 tony NULL Time taken: 33.42 seconds, Fetched: 3 row(s)
3、右关联（right [outer] join）：以右表为主
select a.id,a.name,b.age from rdb_a a right join rdb_b b on a.id=b.id; Total MapReduce CPU Time Spent: 2 seconds 130 msec OK 1 lucy 12 2 jack 22 NULL NULL 32 Time taken: 32.7 seconds, Fetched: 3 row(s)
4、全关联（full [outer] join）：以两个表的记录为基准，返回两个表的记录去重之和，关联不上的字段为NULL。
select a.id,a.name,b.age from rdb_a a full join rdb_b b on a.id=b.id; Total MapReduce CPU Time Spent: 5 seconds 540 msec OK 1 lucy 12 2 jack 22 3 tony NULL NULL NULL 32 Time taken: 42.938 seconds, Fetched: 4 row(s)
5、left semi join：以LEFT SEMI JOIN关键字前面的表为主表，返回主表的KEY也在副表中的记录。
select a.id,a.name from rdb_a a left semi join rdb_b b on a.id=b.id; Total MapReduce CPU Time Spent: 3 seconds 300 msec OK 1 lucy 2 jack Time taken: 31.105 seconds, Fetched: 2 row(s) 其实就相当于：select a.id,a.name from rdb_a a where a.id in(select b.id from rdb_b b );
6、笛卡尔积关联（cross join）：返回两个表的笛卡尔积结果，不需要指定关联键
select a.id,a.name,b.age from rdb_a a cross join rdb_b b; Total MapReduce CPU Time Spent: 1 seconds 260 msec OK 1 lucy 12 1 lucy 22 1 lucy 32 2 jack 12 2 jack 22 2 jack 32 3 tony 12 3 tony 22 3 tony 32 Time taken: 24.727 seconds, Fetched: 9 row(s)
二、Common Join与Map Join

利用hive进行join连接操作，相较于MR有两种执行方案，一种为common join，另一种为map join ，map join是相对于common join的一种优化，省去shullfe和reduce的过程，大大的降低的作业运行的时间。

Common Join（也称之为shufflejoiin/reducejoin）

过程：

1>首先会启动一个Task，Mapper会去读表HDFS上两张X/Y表中的数据
2>Mapper处理过数据再经过shuffle处理
3>最后由reduce输出join结果

缺点 :
1>存在shuffle过程，效率低
2>每张表都要去磁盘读取，磁盘IO大

Map Join

过程：

1>mapjoin首先会通过本地MapReduce Task将要join的小表转成Hash Table Files，然后加载到分布式缓存中
2>Mapperh会去缓存中读取小表数据来和Big Table数据进行join
3>Map直接给出结果

优点：没有shuffle/Reduce过程，效率提高

缺点：由于小表都加载到内存当中，读内存的要求提高了

hive中专门有个参数来设置是否自动将commmon join 转化为map join：hive.auto.convert.join。

当hive.auto.convert.join=true hive会为我们自动转换。
查看全文

相关阅读:
中文乱码总结之web乱码情景
 微信小程序实现navbar导航栏
 boostrap table接收到后台返回的数据格式不一致的解决方法
 bootstrap让footer固定在顶部和底部
 在vue中让某个组件重新渲染的笨方法
 网页打印事件的监听
 关于JavaScript的词法作用域及变量提升的个人理解
 函数节流之debounce
HTML5 a标签的down属性进行图片下载
 Jquery的深浅拷贝涉及到的知识点

原文地址：https://www.cnblogs.com/jnba/p/10673747.html