zoukankan      html  css  js  c++  java
  • PostgreSQL中对IN、EXISTS、ANY/ALL、JOIN的sql优化

    测试环境:

    postgres=# select version();              
                                                     version                                                 
    ---------------------------------------------------------------------------------------------------------
     PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
    (1 row)
    
    postgres=# 
    

      

    数据准备:

    $ pgbench -i -s 10
    
    postgres=# d
                  List of relations
     Schema |       Name       | Type  |  Owner   
    --------+------------------+-------+----------
     public | pgbench_accounts | table | postgres
     public | pgbench_branches | table | postgres
     public | pgbench_history  | table | postgres
     public | pgbench_tellers  | table | postgres
    (4 rows)
    
    postgres=# select * from pgbench_accounts limit 1;
     aid | bid | abalance |                                        filler                                        
    -----+-----+----------+--------------------------------------------------------------------------------------
       1 |   1 |        0 |                                                                                     
    (1 row)
    
    postgres=# select * from pgbench_branches limit 1;
     bid | bbalance | filler 
    -----+----------+--------
       1 |        0 | 
    (1 row)
    
    postgres=# select * from pgbench_history limit 1;
     tid | bid | aid | delta | mtime | filler 
    -----+-----+-----+-------+-------+--------
    (0 rows)
    
    postgres=# select * from pgbench_tellers limit 1;
     tid | bid | tbalance | filler 
    -----+-----+----------+--------
       1 |   1 |        0 | 
    (1 row)
    
    postgres=# select * from pgbench_branches;
     bid | bbalance | filler 
    -----+----------+--------
       1 |        0 | 
       2 |        0 | 
       3 |        0 | 
       4 |        0 | 
       5 |        0 | 
       6 |        0 | 
       7 |        0 | 
       8 |        0 | 
       9 |        0 | 
      10 |        0 | 
    (10 rows)
    
    postgres=# update pgbench_branches set bbalance=4500000 where bid in (4,7);
    UPDATE 2
    postgres=# 
    

     

    IN语句

    查询要求:找出那些余额(balance)大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

    1.使用IN子句

     

    SELECT
       count( aid ),bid 
    FROM
       pgbench_accounts 
    WHERE
       bid IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 ) 
    GROUP BY
       bid;
    

    2.使用ANY子句

    SELECT
       count( aid ),bid 
    FROM
       pgbench_accounts 
    WHERE
       bid = ANY ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 ) 
    GROUP BY
       bid;
    

      

    3.使用EXISTS子句

    SELECT
       count( aid ),bid 
    FROM
       pgbench_accounts 
    WHERE
       EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid ) 
    GROUP BY
       bid;
    

      

    4.使用INNER JOIN

    SELECT
       count( aid ),a.bid 
    FROM
       pgbench_accounts a
       JOIN pgbench_branches b ON a.bid = b.bid 
    WHERE
       b.bbalance > 0 
    GROUP BY
       a.bid;
    

    在完成这个查询要求的时候,有人可能会假设exists和inner join性能可能会更好,因为他们可以使用两表连接的逻辑和优化。而IN和ANY子句需要使用子查询。

    然而,PostgreSQL(10版本之后)已经智能的足以对上面四种写法产生相同的执行计划!

    所有上面的写法都会产生相同的执行计划:

                                                                                QUERY PLAN                                                                            
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Finalize GroupAggregate  (cost=23327.73..23330.26 rows=10 width=12) (actual time=97.199..99.014 rows=2 loops=1)
       Group Key: a.bid
       ->  Gather Merge  (cost=23327.73..23330.06 rows=20 width=12) (actual time=97.191..99.006 rows=6 loops=1)
             Workers Planned: 2
             Workers Launched: 2
             ->  Sort  (cost=22327.70..22327.73 rows=10 width=12) (actual time=93.762..93.766 rows=2 loops=3)
                   Sort Key: a.bid
                   Sort Method: quicksort  Memory: 25kB
                   Worker 0:  Sort Method: quicksort  Memory: 25kB
                   Worker 1:  Sort Method: quicksort  Memory: 25kB
                   ->  Partial HashAggregate  (cost=22327.44..22327.54 rows=10 width=12) (actual time=93.723..93.727 rows=2 loops=3)
                         Group Key: a.bid
                         ->  Hash Join  (cost=1.14..22119.10 rows=41667 width=8) (actual time=24.024..83.263 rows=66667 loops=3)
                               Hash Cond: (a.bid = b.bid)
                               ->  Parallel Seq Scan on pgbench_accounts a  (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.023..43.151 rows=333333 loops=3)
                               ->  Hash  (cost=1.12..1.12 rows=1 width=4) (actual time=0.027..0.028 rows=2 loops=3)
                                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                     ->  Seq Scan on pgbench_branches b  (cost=0.00..1.12 rows=1 width=4) (actual time=0.018..0.020 rows=2 loops=3)
                                           Filter: (bbalance > 0)
                                           Rows Removed by Filter: 8
     Planning Time: 0.342 ms
     Execution Time: 99.164 ms
    (22 rows)
    

    那么,我们是否可以得出这样的结论:我们可以随意地编写查询,而PostgreSQL的智能将会处理其余的问题?!

    等等!

    如果我们考虑排除情况,事情会变得不同。

     

    排除查询

    查询要求:找出那些余额(balance)不大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

    1.使用NOT IN

    SELECT
       count( aid ),bid 
    FROM
       pgbench_accounts 
    WHERE
       bid NOT IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 ) 
    GROUP BY
       bid;
    

    执行计划:

                                                                            QUERY PLAN                                                                        
    ----------------------------------------------------------------------------------------------------------------------------------------------------------
     Finalize GroupAggregate  (cost=23645.42..23647.95 rows=10 width=12) (actual time=128.606..130.502 rows=8 loops=1)
       Group Key: pgbench_accounts.bid
       ->  Gather Merge  (cost=23645.42..23647.75 rows=20 width=12) (actual time=128.598..130.490 rows=24 loops=1)
             Workers Planned: 2
             Workers Launched: 2
             ->  Sort  (cost=22645.39..22645.42 rows=10 width=12) (actual time=124.960..124.963 rows=8 loops=3)
                   Sort Key: pgbench_accounts.bid
                   Sort Method: quicksort  Memory: 25kB
                   Worker 0:  Sort Method: quicksort  Memory: 25kB
                   Worker 1:  Sort Method: quicksort  Memory: 25kB
                   ->  Partial HashAggregate  (cost=22645.13..22645.23 rows=10 width=12) (actual time=124.917..124.920 rows=8 loops=3)
                         Group Key: pgbench_accounts.bid
                         ->  Parallel Seq Scan on pgbench_accounts  (cost=1.13..21603.46 rows=208333 width=8) (actual time=0.078..83.134 rows=266667 loops=3)
                               Filter: (NOT (hashed SubPlan 1))
                               Rows Removed by Filter: 66667
                               SubPlan 1
                                 ->  Seq Scan on pgbench_branches  (cost=0.00..1.12 rows=1 width=4) (actual time=0.020..0.021 rows=2 loops=3)
                                       Filter: (bbalance > 0)
                                       Rows Removed by Filter: 8
     Planning Time: 0.310 ms
     Execution Time: 130.620 ms
    (21 rows)
    
    postgres=# 
    

    2.使用<>ALL

    SELECT
       count( aid ),bid 
    FROM
       pgbench_accounts 
    WHERE
       bid <> ALL ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 ) 
    GROUP BY
       bid;
    

    执行计划:

                                                                             QUERY PLAN                                                                         
    ------------------------------------------------------------------------------------------------------------------------------------------------------------
     Finalize GroupAggregate  (cost=259581.79..259584.32 rows=10 width=12) (actual time=418.220..419.913 rows=8 loops=1)
       Group Key: pgbench_accounts.bid
       ->  Gather Merge  (cost=259581.79..259584.12 rows=20 width=12) (actual time=418.212..419.902 rows=24 loops=1)
             Workers Planned: 2
             Workers Launched: 2
             ->  Sort  (cost=258581.76..258581.79 rows=10 width=12) (actual time=413.906..413.909 rows=8 loops=3)
                   Sort Key: pgbench_accounts.bid
                   Sort Method: quicksort  Memory: 25kB
                   Worker 0:  Sort Method: quicksort  Memory: 25kB
                   Worker 1:  Sort Method: quicksort  Memory: 25kB
                   ->  Partial HashAggregate  (cost=258581.50..258581.60 rows=10 width=12) (actual time=413.872..413.875 rows=8 loops=3)
                         Group Key: pgbench_accounts.bid
                         ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..257539.83 rows=208333 width=8) (actual time=0.054..367.244 rows=266667 loops=3)
                               Filter: (SubPlan 1)
                               Rows Removed by Filter: 66667
                               SubPlan 1
                                 ->  Materialize  (cost=0.00..1.13 rows=1 width=4) (actual time=0.000..0.001 rows=2 loops=1000000)
                                       ->  Seq Scan on pgbench_branches  (cost=0.00..1.12 rows=1 width=4) (actual time=0.001..0.001 rows=2 loops=337880)
                                             Filter: (bbalance > 0)
                                             Rows Removed by Filter: 8
     Planning Time: 0.218 ms
     Execution Time: 420.035 ms
    (22 rows)
    
    postgres=# 
    

    3.使用NOT EXISTS

    SELECT
       count( aid ),bid 
    FROM
       pgbench_accounts 
    WHERE
       NOT EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid ) 
    GROUP BY
       bid;
    

    执行计划:

                                                                               QUERY PLAN                                                                           
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------
     Finalize GroupAggregate  (cost=28327.72..28330.25 rows=10 width=12) (actual time=152.024..153.931 rows=8 loops=1)
       Group Key: pgbench_accounts.bid
       ->  Gather Merge  (cost=28327.72..28330.05 rows=20 width=12) (actual time=152.014..153.917 rows=24 loops=1)
             Workers Planned: 2
             Workers Launched: 2
             ->  Sort  (cost=27327.70..27327.72 rows=10 width=12) (actual time=147.782..147.786 rows=8 loops=3)
                   Sort Key: pgbench_accounts.bid
                   Sort Method: quicksort  Memory: 25kB
                   Worker 0:  Sort Method: quicksort  Memory: 25kB
                   Worker 1:  Sort Method: quicksort  Memory: 25kB
                   ->  Partial HashAggregate  (cost=27327.43..27327.53 rows=10 width=12) (actual time=147.732..147.737 rows=8 loops=3)
                         Group Key: pgbench_accounts.bid
                         ->  Hash Anti Join  (cost=1.14..25452.43 rows=375000 width=8) (actual time=0.134..101.884 rows=266667 loops=3)
                               Hash Cond: (pgbench_accounts.bid = pgbench_branches.bid)
                               ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.032..45.174 rows=333333 loops=3)
                               ->  Hash  (cost=1.12..1.12 rows=1 width=4) (actual time=0.036..0.037 rows=2 loops=3)
                                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                     ->  Seq Scan on pgbench_branches  (cost=0.00..1.12 rows=1 width=4) (actual time=0.025..0.027 rows=2 loops=3)
                                           Filter: (bbalance > 0)
                                           Rows Removed by Filter: 8
     Planning Time: 0.322 ms
     Execution Time: 154.040 ms
    (22 rows)
    
    postgres=# 
    

    4.使用LEFT JOIN和IS NULL

    SELECT
       count( aid ),a.bid 
    FROM
       pgbench_accounts a
       LEFT JOIN pgbench_branches b ON a.bid = b.bid AND b.bbalance > 0 
    WHERE
       b.bid IS NULL 
    GROUP BY
       a.bid;
    

    执行计划:

                                                                                QUERY PLAN                                                                            
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Finalize GroupAggregate  (cost=28327.72..28330.25 rows=10 width=12) (actual time=145.298..147.096 rows=8 loops=1)
       Group Key: a.bid
       ->  Gather Merge  (cost=28327.72..28330.05 rows=20 width=12) (actual time=145.288..147.083 rows=24 loops=1)
             Workers Planned: 2
             Workers Launched: 2
             ->  Sort  (cost=27327.70..27327.72 rows=10 width=12) (actual time=141.883..141.887 rows=8 loops=3)
                   Sort Key: a.bid
                   Sort Method: quicksort  Memory: 25kB
                   Worker 0:  Sort Method: quicksort  Memory: 25kB
                   Worker 1:  Sort Method: quicksort  Memory: 25kB
                   ->  Partial HashAggregate  (cost=27327.43..27327.53 rows=10 width=12) (actual time=141.842..141.847 rows=8 loops=3)
                         Group Key: a.bid
                         ->  Hash Anti Join  (cost=1.14..25452.43 rows=375000 width=8) (actual time=0.087..99.535 rows=266667 loops=3)
                               Hash Cond: (a.bid = b.bid)
                               ->  Parallel Seq Scan on pgbench_accounts a  (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.025..44.337 rows=333333 loops=3)
                               ->  Hash  (cost=1.12..1.12 rows=1 width=4) (actual time=0.026..0.027 rows=2 loops=3)
                                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                     ->  Seq Scan on pgbench_branches b  (cost=0.00..1.12 rows=1 width=4) (actual time=0.019..0.020 rows=2 loops=3)
                                           Filter: (bbalance > 0)
                                           Rows Removed by Filter: 8
     Planning Time: 0.231 ms
     Execution Time: 147.180 ms
    (22 rows)
    
    postgres=# 
    

    NOT IN 和 <> ALL生成执行计划都包含了一个子查询。他们是各自独立的。

    而NOT EXISTS和LEFT JOIN生成了相同的执行计划。

    这些hash连接(或hash anti join)是完成查询要求的最灵活的方式。这也是推荐exists或join的原因。因此,推荐使用exists或join的经验法则是有效的。

     

    但是,我们继续往下看! 即使有了子查询执行计划,NOT IN子句的执行时间也会更好?

    是的。PostgreSQL做了出色的优化,PostgreSQL将子查询计划进行了hash处理。因此PostgreSQL对如何处理IN子句有了更好的理解,这是一种逻辑思维方式,因为很多人倾向于使用IN子句。子查询返回的行很少,但即使子查询返回几百行,也会发生同样的情况。

    但是,如果子查询返回大量行(几十万行)怎么办?让我们尝试一个简单的测试:

    CREATE TABLE t1 AS
    SELECT * FROM generate_series(0, 500000) id;
    
    CREATE TABLE t2 AS
    SELECT (random() * 4000000)::integer id
    FROM generate_series(0, 4000000);
    
    ANALYZE t1;
    ANALYZE t2;
    
    EXPLAIN SELECT id
    FROM t1
    WHERE id NOT IN (SELECT id FROM t2);
    

    执行计划:

                                       QUERY PLAN                                   
    --------------------------------------------------------------------------------
     Gather  (cost=1000.00..15195064853.01 rows=250000 width=4)
       Workers Planned: 1
       ->  Parallel Seq Scan on t1  (cost=0.00..15195038853.01 rows=147059 width=4)
             Filter: (NOT (SubPlan 1))
             SubPlan 1
               ->  Materialize  (cost=0.00..93326.01 rows=4000001 width=4)
                     ->  Seq Scan on t2  (cost=0.00..57700.01 rows=4000001 width=4)
    (7 rows)
    
    postgres=# 
    

    这里,执行计划将子查询进行了物化。代价评估变成了15195038853.01。(PostgreSQL的默认设置,如果t2表的行低于100k,会将子查询进行hash)。这样就会严重影响性能。因此,对于那种子查询返回的行数很少的场景,IN子句可以起到很好的作用。

    其它注意点

    有的!在我们用不同的方式写查询的时候,可能有数据类型的转换。

    比如,语句:

    EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M', 'F']);
    

    就会发生隐式的类型转换:

     Seq Scan on emp  (cost=0.00..1.04 rows=2 width=43) (actual time=0.023..0.026 rows=3 loops=1)
       Filter: ((gen)::text = ANY ('{M,F}'::text[])) 
    

    这里的(gen)::text就发生了类型转换。如果在大表上,这种类型转换的代价会很高,因此,PostgreSQL对IN子句做了更好的处理。

    EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ('M','F');
    
     Seq Scan on emp  (cost=0.00..1.04 rows=3 width=43) (actual time=0.030..0.034 rows=3 loops=1)
       Filter: (gen = ANY ('{M,F}'::bpchar[]))
    

    将IN子句转换成了ANY子句,没有对gen列进行类型转换。而是将MF转成了bpchar(内部等价于char)

     

     

    总结

    简单来说,exists和直接join表通常比较好。

    很多情况下,PostgreSQL将IN子句换成被hash的子计划。在一些特殊场景下,IN可以获得更好的执行计划。

  • 相关阅读:
    java UDP传输
    InetAddress与Socket
    TCP和UDP的联系和用途
    网络通讯要素
    什么是网络编程?
    java 合并流(SequenceInputStream)
    最长上升子序列之基础
    《Linux Device Drivers》第十五章 内存映射和DMA——note
    HDU 5092 DP
    NGUI: UIPanel控件
  • 原文地址:https://www.cnblogs.com/abclife/p/13940948.html
Copyright © 2011-2022 走看看