哪种方法是找出在一张表中,而不在另一张表中的记录的最佳方法呢?
SELECT l.*
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL;
SELECT l.*
FROM t_left l
WHERE l.value NOT IN
(
SELECT value
FROM t_right r
);
SELECT l.*
FROM t_left l
WHERE NOT EXISTS
(
SELECT NULL
FROM t_right r
WHERE r.value = l.value
);
我们先把环境准备一下:
postgres 11.9
CREATE TABLE t_left (
id INT NOT NULL PRIMARY KEY,
value INT NOT NULL,
stuffing VARCHAR(200) NOT NULL
);
CREATE TABLE t_right (
id INT NOT NULL PRIMARY KEY,
value INT NOT NULL,
stuffing VARCHAR(200) NOT NULL
);
CREATE INDEX ix_left_value ON t_left (value); CREATE INDEX ix_right_value ON t_right (value);
INSERT
INTO t_left
SELECT s, s % 10000, RPAD('Value '|| s || ' ', 200, '*')
FROM generate_series(1, 100000) s;
INSERT
INTO t_right
SELECT s, s % 10000 + 1, RPAD('Value '|| s || ' ', 200, '*')
FROM generate_series(1, 1000000) s;
ANALYZE t_left; ANALYZE t_right;
表t_left包含100000行记录,其中有1000个不同的记录;
只是测试
表t_right包含100000行记录,其中有1000个不同的记录。
t_left表有10条记录不存在与t_right表中。
NOT IN
先看看not in
postgres=# SELECT l.id, l.value
postgres-# FROM t_left l
postgres-# WHERE value NOT IN
postgres-# (
postgres(# SELECT value
postgres(# FROM t_right r
postgres(# );
id | value
--------+-------
20000 | 0
10000 | 0
60000 | 0
30000 | 0
40000 | 0
50000 | 0
70000 | 0
80000 | 0
90000 | 0
100000 | 0
(10 rows)
postgres=# explain analyze SELECT l.id, l.value
postgres-# FROM t_left l
postgres-# WHERE value NOT IN
postgres-# (
postgres(# SELECT value
postgres(# FROM t_right r
postgres(# );
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..1520921531.00 rows=50000 width=8) (actual time=7329.726..40554.033 rows=10 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on t_left l (cost=0.00..1520915531.00 rows=29412 width=8) (actual time=5506.368..40405.539 rows=5 loops=2)
Filter: (NOT (SubPlan 1))
Rows Removed by Filter: 49995
SubPlan 1
-> Materialize (cost=0.00..49211.00 rows=1000000 width=4) (actual time=0.001..0.534 rows=4084 loops=100000)
-> Seq Scan on t_right r (cost=0.00..40304.00 rows=1000000 width=4) (actual time=0.002..0.575 rows=3739 loops=37433)
Planning Time: 0.155 ms
Execution Time: 40557.112 ms
(11 rows)
postgres=#
这里之所以先讨论not in,是因为其对待null的方式特殊。
不幸的是,postgresql优化器不能利用到t_right.value被定义为not null的事实。因此,不可以返回null值。(即not in不能返回null值)
这里可以做一个小测试:
postgres=# create table aa(id int,age int); postgres=# insert into aa values(1,1); postgres=# insert into aa values(2,2); postgres=# insert into aa(id) values(3); postgres=# select * from aa where age not in (1,2); id | age ----+----- (0 rows) postgres=# select * from aa where age not in (1); id | age ----+----- 2 | 2 (1 row) postgres=#
这就是为什么postgresql对not in查询使用一个特殊的访问方法。
NOT EXISTS
postgres=# SELECT l.id, l.value
postgres-# FROM t_left l
postgres-# WHERE NOT EXISTS
postgres-# (
postgres(# SELECT select
postgres(# FROM t_right r
postgres(# WHERE r.value = l.value
postgres(# );
id | value
--------+-------
10000 | 0
30000 | 0
20000 | 0
40000 | 0
60000 | 0
50000 | 0
80000 | 0
70000 | 0
90000 | 0
100000 | 0
(10 rows)
postgres=# explain analyze SELECT l.id, l.value
postgres-# FROM t_left l
postgres-# WHERE NOT EXISTS
postgres-# (
postgres(# SELECT value
postgres(# FROM t_right r
postgres(# WHERE r.value = l.value
postgres(# );
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.42..34410.22 rows=10 width=8) (actual time=31.555..262.307 rows=10 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Nested Loop Anti Join (cost=0.42..33409.22 rows=6 width=8) (actual time=43.314..257.792 rows=5 loops=2)
-> Parallel Seq Scan on t_left l (cost=0.00..3619.24 rows=58824 width=8) (actual time=0.007..10.066 rows=50000 loops=2)
-> Index Only Scan using ix_right_value on t_right r (cost=0.42..3.53 rows=100 width=4) (actual time=0.005..0.005 rows=1 loops=100000)
Index Cond: (value = l.value)
Heap Fetches: 99990
Planning Time: 0.328 ms
Execution Time: 262.348 ms
(10 rows)
postgres=#
LEFT JOIN / IS NULL
postgres=# SELECT l.id, l.value
postgres-# FROM t_left l
postgres-# WHERE NOT EXISTS
postgres-# (
postgres(# SELECT value
postgres(# FROM t_right r
postgres(# WHERE r.value = l.value
postgres(# );
id | value
--------+-------
80000 | 0
10000 | 0
20000 | 0
30000 | 0
40000 | 0
50000 | 0
60000 | 0
70000 | 0
90000 | 0
100000 | 0
(10 rows)
postgres=# explain analyze SELECT l.id, l.value
postgres-# FROM t_left l
postgres-# WHERE NOT EXISTS
postgres-# (
postgres(# SELECT value
postgres(# FROM t_right r
postgres(# WHERE r.value = l.value
postgres(# );
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.42..34410.22 rows=10 width=8) (actual time=32.109..259.553 rows=10 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Nested Loop Anti Join (cost=0.42..33409.22 rows=6 width=8) (actual time=43.744..255.254 rows=5 loops=2)
-> Parallel Seq Scan on t_left l (cost=0.00..3619.24 rows=58824 width=8) (actual time=0.008..9.777 rows=50000 loops=2)
-> Index Only Scan using ix_right_value on t_right r (cost=0.42..3.53 rows=100 width=4) (actual time=0.005..0.005 rows=1 loops=100000)
Index Cond: (value = l.value)
Heap Fetches: 99990
Planning Time: 0.396 ms
Execution Time: 259.608 ms
(10 rows)
postgres=#
postgresql按照相同的方式对待left join和not exists,使用相同的执行计划(nested loop anti join)。
至于NOT IN,这在语义上是不同的, PostgreSQL试图考虑这一点,并限制自己对子计划使用过滤器。