zoukankan      html  css  js  c++  java
  • postgresql创建统计信息优化





    postgres=# create table test2(n_id int,id1 int,id2 int);
    postgres=# insert into test2 select i,i/1000,i/10000 from generate_series(1,1000000) s(i);
    INSERT 0 1000000
    postgres=# analyze test2;
    postgres=# \x
    Expanded display is on.
    postgres=# select * from pg_stats where tablename = 'test2' and attname = 'id1';
    -[ RECORD 1 ]----------+-------------------------------------------------------------------------------------
    schemaname             | public
    tablename              | test2
    attname                | id1
    inherited              | f
    null_frac              | 0
    avg_width              | 4
    n_distinct             | 1000
    most_common_vals       | {381,649,852,142,269,415,496,537,714,80,177,303,526,870,924}
    most_common_freqs      | {0.0016,0.0016,0.0015666666,0.0015333333,0.0015333333...}
    histogram_bounds       | {0,10,19,29,39,49,59,69,78,89,99,109,119,128,139,...}
    correlation            | 1
    most_common_elems      | 
    most_common_elem_freqs | 
    elem_count_histogram   | 
    postgres=# explain (analyse,buffers) select * from test2 where id1 = 1;
                                                    QUERY PLAN                                                 
     Seq Scan on test2  (cost=0.00..17906.00 rows=992 width=12) (actual time=0.144..109.573 rows=1000 loops=1)
       Filter: (id1 = 1)
       Rows Removed by Filter: 999000
       Buffers: shared hit=5406
     Planning Time: 0.118 ms
     Execution Time: 109.697 ms
    (6 rows)


    • 如果在id1id2上都过滤数据时,会怎么样?
    postgres=# explain (analyse,buffers) select * from test2 where id1 = 1 and id2= 0;
                                                    QUERY PLAN                                                
     Seq Scan on test2  (cost=0.00..20406.00 rows=10 width=12) (actual time=0.153..138.057 rows=1000 loops=1)
       Filter: ((id1 = 1) AND (id2 = 0))
       Rows Removed by Filter: 999000
       Buffers: shared hit=5406
     Planning Time: 0.267 ms
     Execution Time: 138.184 ms
    (6 rows)


    第一列的选择性大约是0.001(1/1000),第二列的选择性是0.01(1/100)。为了计算被这2个 "独立 "条件过滤的行数,planner将它们的选择性相乘。所以,我们得到

    选择性 = 0. 001 * 0. 01 = 0. 00001




    回到我们之前的估算问题,问题是col2的值其实不过是col1 / 10。在数据库术语中,我们会说col2在功能上依赖于col1。这意味着col1的值足以决定col2的值,不存在两行col1的值相同而col2的值不同的情况。因此,col2上的第2个过滤器实际上并没有删除任何行!但是,planner捕捉到了足够的统计数据。但是,规划者捕捉到了足够的统计数据来知道这一点。

    postgres=# create statistics s1(dependencies) on id1,id2 from test2;
    postgres=# analyze test2;
    postgres=# explain (analyse,buffers) select * from test2 where id1 = 1 and id2 = 0;
                                                    QUERY PLAN                                         
     Seq Scan on test2  (cost=0.00..20406.00 rows=997 width=12) (actual time=0.159..124.450 rows=1000 l oops=1)
       Filter: ((id1 = 1) AND (id2 = 0))
       Rows Removed by Filter: 999000
       Buffers: shared hit=5406
     Planning Time: 0.364 ms
     Execution Time: 124.592 ms
    (6 rows)
    postgres=# SELECT stxname,stxkeys,extdat.stxddependencies  FROM pg_statistic_ext ext join pg_statistic_ext_data extdat on ext.oid = extdat.stxoid; 
     stxname | stxkeys |   stxddependencies   
     s1      | 2 3     | {"2 => 3": 1.000000}
    (1 row)
    --stxkeys中的2 3表示表的第二列和第三列
    postgres=# select statistics_name,attnames,dependencies from  pg_stats_ext;
     statistics_name | attnames  |     dependencies     
     s1              | {id1,id2} | {"2 => 3": 1.000000}
    (1 row)


    如果没有函数依赖性统计,规划器会认为两个WHERE条件是独立的, 并且会将它们的选择性乘以一起,以致得到太小的行数估计。 通过这样的统计,规划器认识到WHERE条件是多余的,并且不会低估行数。


    ndistinct 统计

    单列统计信息存储每一列中可区分值的数量。在组合多个列(例如GROUP BY a,b)时,如果规划器只有单列统计数据,则对可区分值数量的估计常常会错误,导致选择不好的计划

    --对test2表进行group by id1,id2操作
    postgres=# explain (analyse,buffers) select id1,id2,count(*) from test2  group by id1,id2;
                                                          QUERY PLAN                                   
     HashAggregate  (cost=22906.00..23906.00 rows=100000 width=16) (actual time=473.444..474.544 rows=1001 loops=1)
       Group Key: id1, id2
       Buffers: shared hit=5406
       ->  Seq Scan on test2  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.022..178.253 rows=1000000 loops=1)
             Buffers: shared hit=5406
     Planning Time: 1.202 ms
     Execution Time: 479.178 ms
    (7 rows)


    postgres=# create statistics s2(ndistinct) on id1,id2 from test2;
    postgres=# analyze test2;
    postgres=# explain (analyse,buffers) select id1,id2,count(*) from test2  group by id1,id2;
                                                          QUERY PLAN                                    
     HashAggregate  (cost=22906.00..22916.00 rows=1000 width=16) (actual time=442.839..443.160 rows=1001 loops=1)
       Group Key: id1, id2
       Buffers: shared hit=5406
       ->  Seq Scan on test2  (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.029..147.498 rows=1000000 loops=1)
             Buffers: shared hit=5406
     Planning Time: 0.364 ms
     Execution Time: 444.362 ms
    (7 rows)
    postgres=# sELECT stxname,stxkeys,extdat.stxdndistinct  FROM pg_statistic_ext ext join pg_statistic_ext_data extdat on ext.oid = extdat.stxoid where stxname = 's2'; 
     stxname | stxkeys | stxdndistinct  
     s2      | 2 3     | {"2, 3": 1000}
    (1 row)

    常见的情况有月,季,年的列。省,市,区等需要联合group by的情况



    MCV(most common values)

    如果没有函数依赖性统计,规划器会认为两个WHERE条件是独立的, 并且会将它们的选择性乘以一起,以致得到太小的行数估计。 通过这样的统计,规划器认识到WHERE条件是多余的,并且不会低估行数。


    CREATE TABLE t2 (a   int,b   int);
    INSERT INTO t2 SELECT mod(i,100), mod(i,100)FROM generate_series(1,1000000) s(i);
    analyze t2;
    postgres=#  EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 70);
                                                  QUERY PLAN                                           
     Seq Scan on t2  (cost=0.00..19425.00 rows=97 width=8) (actual time=0.038..123.438 rows=10000 loops=1)
       Filter: ((a = 70) AND (b = 70))
       Rows Removed by Filter: 990000
     Planning Time: 0.150 ms
     Execution Time: 124.647 ms
    (5 rows)
    CREATE STATISTICS s3 (mcv) ON a, b FROM t2;
    ANALYZE t2;
    -- valid combination (found in MCV),a=70 and b=70预估11267,实际10000,相差不大
    postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 70);
                                                    QUERY PLAN                                           
     Seq Scan on t2  (cost=0.00..19425.00 rows=11267 width=8) (actual time=0.069..181.738 rows=10000 loops=1)
       Filter: ((a = 70) AND (b = 70))
       Rows Removed by Filter: 990000
     Planning Time: 1.120 ms
     Execution Time: 182.452 ms
    (5 rows)
    -- invalid combination (not found in MCV)
    postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 70) AND (b = 80);
                                                 QUERY PLAN                                            
     Seq Scan on t2  (cost=0.00..19425.00 rows=1 width=8) (actual time=125.878..125.879 rows=0 loops=1)
       Filter: ((a = 70) AND (b = 80))
       Rows Removed by Filter: 1000000
     Planning Time: 0.207 ms
     Execution Time: 125.945 ms
    (5 rows)
    postgres=# SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
    postgres-# , pg_mcv_list_items(stxdmcv) m WHERE stxname = 's3';
     index | values  | nulls |      frequency       |     base_frequency     
         0 | {70,70} | {f,f} | 0.011266666666666666 | 0.00012693777777777776
         1 | {78,78} | {f,f} |               0.0111 |             0.00012321
         2 | {32,32} | {f,f} | 0.011066666666666667 | 0.00012247111111111112
         3 | {13,13} | {f,f} | 0.011033333333333332 | 0.00012173444444444442
         4 | {82,82} | {f,f} |                0.011 | 0.00012099999999999999
    --当 WHERE (a = 70) AND (b = 70)的时候rows=11267是如何计算的呢
    rows=  1000000 *  0.011266666666666666
         = 11267

    ab的组合中实际频率(在样本中)约为1%。 组合的基本频率(根据简单的每列频率计算)仅为0.01%,导致两个数量级的低估。

    --计算WHERE (a = 1) AND (b = 2)的预估值
    alter table  t2 alter column  a SET STATISTICS 10;
    alter table  t2 alter column  b SET STATISTICS 10;
    analyze t2;
    postgres=# SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
    postgres-# WHERE tablename='t2' AND attname in('a','b');
     null_frac | n_distinct | most_common_vals | most_common_freqs 
             0 |        100 | {7,85}           | {0.015,0.015}
             0 |        100 | {7,85}           | {0.015,0.015}
    (2 rows)
    postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
                                                 QUERY PLAN                                            
     Seq Scan on t2  (cost=0.00..19425.75 rows=98 width=8) (actual time=137.876..137.876 rows=0 loops=1)
       Filter: ((a = 1) AND (b = 2))
       Rows Removed by Filter: 1000000
     Planning Time: 0.452 ms
     Execution Time: 137.924 ms
    (5 rows)
    selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
    postgres=# select (1-(0.014999999664723873+0.014999999664723873))/(100-2);
    (1 row)
    rows= reltuple*selectivity(a=1) * selectivity(b = 2)
    postgres=# select 1000000*0.00989795919051583933*0.00989795919051583933;
    (1 row)
     --其他a=70 and b=80没有在高频中,预估也是98
    postgres=# EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a =70) AND (b =80);
                                                 QUERY PLAN                                            
     Seq Scan on t2  (cost=0.00..19425.75 rows=98 width=8) (actual time=121.889..121.889 rows=0 loops=1)
       Filter: ((a = 70) AND (b = 80))
       Rows Removed by Filter: 1000000
     Planning Time: 0.160 ms
     Execution Time: 121.952 ms
    (5 rows)

    建议仅在实际在条件中一起使用的列的组合上创建MCV统计对象,对于这些组合,错误估计组数会导致糟糕的执行计划。 否则,只会浪费ANALYZE和规划时间。


    提升该限制可能会让规划器做出更准确的估计(特别是对那些有不规则数据分布的列), 其代价是在pg_statistic中消耗了更多空间,并且需要略微多一些的时间来计算估计数值。

    postgres=# show default_statistics_target;
        (1 row)
    • 可以修改表字段也可以修改索引
    ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]  
        action [, ... ]  
        ALTER [ COLUMN ] column_name SET STATISTICS integer 
    postgres=# CREATE TABLE test AS (SELECT random() x, random() y FROM generate_series(1,1000000));
    SELECT 1000000
    postgres=# ANALYZE test;
    postgres=# create index i_test_idx on test((x+y));
    postgres=# analyze test;
    postgres=# explain analyze select * from test where x+y <0.01;
                                                          QUERY PLAN                                   
    Bitmap Heap Scan on test  (cost=7.68..673.21 rows=652 width=16) (actual time=0.036..0.283 rows=60 loops=1)
      Recheck Cond: ((x + y) < '0.01'::double precision)
      Heap Blocks: exact=60
      ->  Bitmap Index Scan on i_test_idx  (cost=0.00..7.51 rows=652 width=0) (actual time=0.017..0.017 rows=60 loops=1)
            Index Cond: ((x + y) < '0.01'::double precision)
    Planning Time: 0.569 ms
    Execution Time: 0.342 ms
    (7 rows)
    postgres=# ALTER INDEX i_test_idx ALTER COLUMN expr SET STATISTICS 3000;
    postgres=# analyze test;
    postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                          QUERY PLAN                                   
    Index Scan using i_test_idx on test  (cost=0.42..135.64 rows=121 width=16) (actual time=0.011..0.277 rows=60 loops=1)
      Index Cond: ((x + y) < '0.01'::double precision)
    Planning Time: 0.515 ms
    Execution Time: 0.342 ms
    (4 rows)
    postgres=#  ALTER INDEX i_test_idx ALTER COLUMN expr SET STATISTICS 10000;
    postgres=# analyze test;
    postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                         QUERY PLAN                                    
     Index Scan using i_test_idx on test  (cost=0.42..80.87 rows=71 width=16) (actual time=0.010..0.217
     rows=60 loops=1)
       Index Cond: ((x + y) < '0.01'::double precision)
     Planning Time: 0.784 ms
     Execution Time: 0.283 ms
    (4 rows)

    使用alter 修改statistics3000,这个数字设置了直方图中使用了多少个桶以及存储了多少个最常见的值,


    • 查看修改的值
    postgres=# select cla.relname,att.attname,att.attstattarget  from pg_attribute att join pg_class cla on att.attrelid=cla.oid where cla.relname = 'i_test_idx';
      relname   | attname | attstattarget 
     i_test_idx | expr    |          3000
    (1 row)
    postgres=# \d+ i_test_idx
                           Index "public.i_test_idx"
     Column |       Type       | Key? | Definition | Storage | Stats target 
     expr   | double precision | yes  | (x + y)    | plain   | 3000
    btree, for table "public.test"




    • 1、在有limit 1的情况下,实际的行数只有1并且运行时间远低于开销估计所建议的时间。这并非预估错误
    postgres=# explain analyze select * from test3 where n_id < 1000 limit 1;
                                                               QUERY PLAN                              
     Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.022..0.022 rows=1 loops=1)
       ->  Index Only Scan using i_test3_id on test3  (cost=0.29..25.13 rows=939 width=4) (actual time=
    0.019..0.019 rows=1 loops=1)
             Index Cond: (n_id < 1000)
             Heap Fetches: 1
     Planning Time: 0.297 ms
     Execution Time: 0.092 ms
    (6 rows)
    postgres=# explain analyze select * from test3 where n_id < 1000 ;
                                                             QUERY PLAN                                
     Index Only Scan using i_test3_id on test3  (cost=0.29..25.13 rows=939 width=4) (actual time=0.020..19.543 rows=999 loops=1)
       Index Cond: (n_id < 1000)
       Heap Fetches: 999
     Planning Time: 0.718 ms
     Execution Time: 19.707 ms
    (5 rows)
    • 2、归并连接也有这样的情况,如果一个归并连接用尽了一个输入并且其中的最后一个键值小于另一个输入中的下一个键值,它将停止读取另一个输入。在这种情况下不过会有更多的匹配,因此不需要第二个输入的剩余部分。这会导致不读取另一个子节点的所有内容

    • Index Scan using i_aj_all_bh_ysaj预估的代价是47W,而最终总的预估为39W

    GroupAggregate  (cost=391236.41..391243.61 rows=188 width=45) (actual time=5839.527..5861.034 rows=184 loops=1)
      Group Key: test_1.c_jbfy
      ->  Sort  (cost=391236.41..391238.18 rows=710 width=38) (actual time=5839.324..5847.166 rows=105340 loops=1)
            Sort Key: test_1.c_jbfy
            Sort Method: quicksort  Memory: 11302kB
            ->  Merge Join  (cost=342460.08..391202.78 rows=710 width=38) (actual time=4280.410..5688.354 rows=105340 loops=1)
                  Merge Cond: ((test.c_bh_ysaj)::text = (test_1.c_bh)::text)
                  ->  Index Scan using i_aj_all_bh_ysaj on test  (cost=0.43..470402.81 rows=127589 width=32) (actual time=0.012..1054.034 rows=162022 loops=1)
                        Index Cond: (c_bh_ysaj IS NOT NULL)
                        Filter: ((c_ah IS NOT NULL) AND (d_jarq >= to_date('20190101'::text, 'yyyymmdd'::text)) AND (d_jarq <= to_date('20200101'::text, 'yyyymmdd'::text))))
                        Rows Removed by Filter: 254334
                  ->  Sort  (cost=342459.60..343014.47 rows=221949 width=38) (actual time=4279.757..4342.135 rows=357118 loops=1)
                        Sort Key: test_1.c_bh
                        Sort Method: quicksort  Memory: 40188kB



    2、针对有group by a,b这种的可以创建ndistinct来改善执行计划


    5、再有limitmerge join的情况下代价是不一样的





  • 相关阅读:
    使用python的Flask实现一个RESTful API服务器端
    Ubuntu 13.04/12.10安装Oracle 11gR2图文教程(转)
    Linux 下开启ssh服务(转)
    PLSQL Developer 9.如何设置查询返回所有纪录(转)
    linux下安装oracle11g 64位最简客户端(转)
    在用TabbarController中出现navigationController 嵌套报错
  • 原文地址:https://www.cnblogs.com/zhangfx01/p/15587556.html
Copyright © 2011-2022 走看看