引用:https://blog.csdn.net/strongyoung88/article/details/81156271
谓词下推概念
谓词下推 Predicate Pushdown(PPD)
:简而言之,就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。
PPD 配置
PPD
控制参数:hive.optimize.ppd
- Default Value: true
- Added In: Hive 0.4.0
Push
:谓词下推,可以理解为被优化Not Push
:谓词没有下推,可以理解为没有被优化
实验
实验结果列表形式:
Pushed or Not | SQL |
---|---|
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Not Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Not Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Not Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Not Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
实验结果表格形式:
Join(inner join) | Left Outer Join | Right Outer Join | Full Outer Join | |||||
---|---|---|---|---|---|---|---|---|
Left Table | Right Table | Left Table | Right Table | Left Table | Right Table | Left Table | Right Table | |
Join Predicate | Pushed | Pushed | Not Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Not Pushed |
Where Predicate | Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Pushed | Not Pushed | Not Pushed |
此表实际上就是上述PPD规则表
结论
1、对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
2、对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
3、对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;
4、当条件分散在两个表时,谓词下推可按上述结论2和3自由组合,情况如下:
SQL | 过滤时机 |
---|---|
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001' and D.dept_id = 'D001'); |
dept_id在map端过滤,eid在reduce端过滤 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id = 'D001') where E.eid='HZ001'; |
dept_id,eid都在map端过滤 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') where D.dept_id = 'D001'; |
dept_id,eid都在reduce端过滤 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id ) where E.eid='HZ001' and D.dept_id = 'D001'; |
dept_id在reduce端过滤,eid在map端过滤 |
注意:如果在表达式中含有不确定函数,整个表达式的谓词将不会被pushed,例如
select a.* from a join b on a.id = b.id where a.ds = '2019-10-09' and a.create_time = unix_timestamp();
因为unix_timestamp
是不确定函数,在编译的时候无法得知,所以,整个表达式不会被pushed,即ds='2019-10-09'
也不会被提前过滤。类似的不确定函数还有rand()
等。