zoukankan      html  css  js  c++  java
  • 再谈PG索引-存储架构

    1.索引的基本架构

    PG的索引是B+树,B+树是为磁盘或其他直接存取辅助设备而设计的一种平衡查找树,在B+树中,所有记录节点都是按键值的大小顺序存放在同一层的叶节点中,各叶节点指针进行连接:

    			    meta page
    				|
    			root page(8kb,一个记录占32个bit,那么就能存256个branch page,超过了就需要扩充一级branch page来存储leaf page)
    				|
    		    branch page … 
    			|	         | 			|
    	branch page	branch page	branch page  …
    	| 	    |   			| 		| 	| 		| 		|
    leaf page 	leaf page leaf page leaf page leaf page  leaf page leaf page …
    	|	
    ———————-------------
    |        key       |
    |  (block,offset)  |    一个leaf page存放多个索引值
    ———————-------------

    其中meta page和root page是必须有的,meta page需要一个页来存储,表示指向root page的page id。 

    随着记录数的增加,一个root page可能存不下所有的heap item,就会有branch page,甚至多层的branch page。 

    leaf page存储具体的key和value。

    一共有几层branch,就用btree page元数据的 level 来表示,如果level为0,则表示没有branch层,root page直接指向leaf page,最多记录256条记录(假如条指针占32bit);level为1, 则表示有一层branch page,则root page存放branch page的指针,branch page指向leaf page,最多记录256*256条。依次类推,且bock size是可以设置的,当设置的更大,则一个级别存储的数据就更多。

    2.看看具体的结构

    PostgreSQL B-Tree是一种变种(high-concurrency B-tree management algorithm),算法详情请参考 src/backend/access/nbtree/README。我们可以使用pageinspect插件,内窥B-Tree的结构:

    apple=# create extension pageinspect;
    
    CREATE EXTENSION
    
    apple=# dx
    
                                    List of installed extensions
    
        Name     | Version |   Schema   |                      Description
    
    -------------+---------+------------+-------------------------------------------------------
    
     pageinspect | 1.7     | public     | inspect the contents of database pages at a low level
    
     plpgsql     | 1.0     | pg_catalog | PL/pgSQL procedural language
    
    (2 rows)
    
    apple=# create table test(id int primary key, info text);
    
    CREATE TABLE
    
    apple=# insert into test select generate_series(1, 1000), md5(random()::text);
    
    INSERT 0 1000
    
    apple=# vacuum ANALYZE test;
    
    VACUUM
    
    apple=# d test
    
                    Table "public.test"
    
     Column |  Type   | Collation | Nullable | Default
    
    --------+---------+-----------+----------+---------
    
     id     | integer |           | not null |
    
     info   | text    |           |          |
    
    Indexes:
    
        "test_pkey" PRIMARY KEY, btree (id)
    
     
    
    apple=# select * from bt_metap('test_pkey');
    
     magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
    
    --------+---------+------+-------+----------+-----------+-------------+-------------------------
    
     340322 |       3 |    3 |     1 |        3 |         1 |           0 |                    1000
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey',1);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey',2);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         2 | l    |        367 |          0 |            16 |      8192 |       808 |         1 |         4 |    0 |          1
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey',3);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey',0);
    
    2019-05-29 11:10:47.567 CST [49885] ERROR:  block 0 is a meta page
    
    2019-05-29 11:10:47.567 CST [49885] STATEMENT:  select * from bt_page_stats('test_pkey',0);
    
    ERROR:  block 0 is a meta page
    
    apple=# select * from bt_page_stats('test_pkey',4);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         4 | l    |        268 |          0 |            16 |      8192 |      2788 |         2 |         0 |    0 |          1
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey',5);
    
    2019-05-29 11:11:18.181 CST [49885] ERROR:  block number out of range
    
    2019-05-29 11:11:18.181 CST [49885] STATEMENT:  select * from bt_page_stats('test_pkey',5);
    
    ERROR:  block number out of range
    

    meta page

    root page         #  btpo_flags=2

    branch page    #  btpo_flags=0

    leaf page         #  btpo_flags=1

    如果即是leaf又是root  btpo_flags=3

    根据btpo_flage可以看出结构应该是这样,索引占了5block:

    				meta page (block 0,btpo_flags为2,上面可以看到最后一列)
    					|
    				root page  ( block 3)
    		|				|				|
    leaf page( block 1) 	leaf page ( block 2)leaf page( block 4) 
    		|	
    ———————-------------
    |        key       |
    |  (block,offset)  |    
    ———————-------------
    

    查看root page

    apple=# select * from bt_page_items('test_pkey',3);
    
     itemoffset |  ctid  | itemlen | nulls | vars |          data
    
    ------------+--------+---------+-------+------+-------------------------
    
              1 | (1,0)  |       8 | f     | f    |
    
              2 | (2,7)  |      16 | f     | f    | 6f 01 00 00 00 00 00 00
    
              3 | (4,13) |      16 | f     | f    | dd 02 00 00 00 00 00 00
    
    (3 rows)
    
    

    查看leaf page

    apple=# select * from bt_page_items('test_pkey',1);
     itemoffset |  ctid   | itemlen | nulls | vars |          data
    ------------+---------+---------+-------+------+-------------------------
              1 | (3,7)   |      16 | f     | f    | 6f 01 00 00 00 00 00 00
              2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
              3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
              4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00
              5 | (0,4)   |      16 | f     | f    | 04 00 00 00 00 00 00 00
              6 | (0,5)   |      16 | f     | f    | 05 00 00 00 00 00 00 00
              7 | (0,6)   |      16 | f     | f    | 06 00 00 00 00 00 00 00
              8 | (0,7)   |      16 | f     | f    | 07 00 00 00 00 00 00 00
    ….
    

    查看一个item对应的记录:

    apple=# select * from test where ctid = '(3,7)';
     id  |               info
    -----+----------------------------------
     367 | 06818c090f9e5f63c95764342590a598
    (1 row)
    

    那么索引里面的key是怎么排序的?块为什么不是连续的,块2变为了root page

    3.查看level的变化

    apple=# drop table test;
    
    DROP TABLE
    
    apple=# create table test(id int primary key, info text);
    
    CREATE TABLE
    
    apple=# insert into test select t.id, md5(random()::text) from generate_series(1, 20) as t(id);
    
    INSERT 0 20
    
    apple=# select * from bt_page_stats('test_pkey', 1);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         1 | l    |         20 |          0 |            16 |      8192 |      7748 |         0 |         0 |    0 |          3
    
    (1 row)
    
     
    
    apple=# select * from bt_metap('test_pkey');
    
     magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
    
    --------+---------+------+-------+----------+-----------+-------------+-------------------------
    
     340322 |       3 |    1 |     0 |        1 |         0 |           0 |                      -1
    
    (1 row)
    
     
    
    apple=# select bt_page_items('test_pkey', 1);
    
                     bt_page_items
    
    ------------------------------------------------
    
     (1,"(0,1)",16,f,f,"01 00 00 00 00 00 00 00")
    
     (2,"(0,2)",16,f,f,"02 00 00 00 00 00 00 00")
    
     (3,"(0,3)",16,f,f,"03 00 00 00 00 00 00 00")
    
     (4,"(0,4)",16,f,f,"04 00 00 00 00 00 00 00")
    
     (5,"(0,5)",16,f,f,"05 00 00 00 00 00 00 00")
    
     (6,"(0,6)",16,f,f,"06 00 00 00 00 00 00 00")
    
     (7,"(0,7)",16,f,f,"07 00 00 00 00 00 00 00")
    
     (8,"(0,8)",16,f,f,"08 00 00 00 00 00 00 00")
    
     (9,"(0,9)",16,f,f,"09 00 00 00 00 00 00 00")
    
     (10,"(0,10)",16,f,f,"0a 00 00 00 00 00 00 00")
    
     (11,"(0,11)",16,f,f,"0b 00 00 00 00 00 00 00")
    
     (12,"(0,12)",16,f,f,"0c 00 00 00 00 00 00 00")
    
     (13,"(0,13)",16,f,f,"0d 00 00 00 00 00 00 00")
    
     (14,"(0,14)",16,f,f,"0e 00 00 00 00 00 00 00")
    
     (15,"(0,15)",16,f,f,"0f 00 00 00 00 00 00 00")
    
     (16,"(0,16)",16,f,f,"10 00 00 00 00 00 00 00")
    
     (17,"(0,17)",16,f,f,"11 00 00 00 00 00 00 00")
    
     (18,"(0,18)",16,f,f,"12 00 00 00 00 00 00 00")
    
     (19,"(0,19)",16,f,f,"13 00 00 00 00 00 00 00")
    
     (20,"(0,20)",16,f,f,"14 00 00 00 00 00 00 00")
    
    (20 rows)
    

    插入20条,一个root页面就能存放,那么就没有必要申请一个leaf pageroot page就是leaf page,他们的btpo_flags就是3;由于没有branch page,因此level也就是0

    继续插入数据:

    apple=# insert into test select t.id, md5(random()::text) from generate_series(21, 1000) as t(id);
    
    INSERT 0 980
    
    apple=# select * from bt_metap('test_pkey');
    
     magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
    
    --------+---------+------+-------+----------+-----------+-------------+-------------------------
    
     340322 |       3 |    3 |     1 |        3 |         1 |           0 |                      -1
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey', 0);
    
    ERROR:  block 0 is a meta page
    
    apple=# select * from bt_page_stats('test_pkey', 1);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey', 2);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         2 | l    |        367 |          0 |            16 |      8192 |       808 |         1 |         4 |    0 |          1
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey', 3);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey', 4);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         4 | l    |        268 |          0 |            16 |      8192 |      2788 |         2 |         0 |    0 |          1
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey', 5);
    
    ERROR:  block number out of range
    

    可以看到leve0变为了1,有了新的root page,从block 1变为了block 3,且加入了三个新的leaf page

    我们可以看到一个leaf页面大概在以int类型为索引时,大概可以存放367条记录,而一个root page中记录一个leaf page指针只需要13bit大小,那么我们继续增大多少条,可以出现branch page呢?

    (8192/13) * 367 - 1000 = 230210条,那么我们就插入数据试试:

    apple=# insert into test select t.id, md5(random()::text) from generate_series(1001, 230210) as t(id);
    
    INSERT 0 229210
    
    apple=# select * from bt_page_stats('test_pkey', 0);
    
    ERROR:  block 0 is a meta page
    
    apple=# analyze ;
    
    ANALYZE
    
    apple=# select * from bt_metap('test_pkey');
    
     magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
    
    --------+---------+------+-------+----------+-----------+-------------+-------------------------
    
     340322 |       3 |  412 |     2 |      412 |         2 |           0 |                      -1
    
    (1 row)
    
    apple=# select * from bt_page_stats('test_pkey', 412);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
       412 | r    |          2 |          0 |            12 |      8192 |      8116 |         0 |         0 |    2 |          2
    
    (1 row)
    
    apple=# select * from bt_page_stats('test_pkey', 411);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
       411 | i    |        344 |          0 |            15 |      8192 |      1276 |         3 |         0 |    1 |          0
    
    (1 row)
    
    apple=# select * from bt_page_stats('test_pkey', 3);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         3 | i    |        286 |          0 |            15 |      8192 |      2436 |         0 |       411 |    1 |          0
    
    (1 row)
    
     
    
    apple=# select * from bt_page_stats('test_pkey', 8);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         8 | l    |        367 |          0 |            16 |      8192 |       808 |         7 |         9 |    0 |          1
    
    (1 row)
    
    apple=# select * from bt_page_stats('test_pkey', 1);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1
    
    (1 row)
    
    apple=# select * from bt_page_stats('test_pkey', 2);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         2 | l    |        367 |          0 |            16 |      8192 |       808 |         1 |         4 |    0 |          1
    
    (1 row)
    
    apple=# select * from bt_page_stats('test_pkey', 4);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
         4 | l    |        367 |          0 |            16 |      8192 |       808 |         2 |         5 |    0 |          1
    
    (1 row)
    
    apple=# select * from bt_page_stats('test_pkey', 632);
    
     blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
    
    -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
    
       632 | l    |        362 |          0 |            16 |      8192 |       908 |       631 |         0 |    0 |          1
    
    (1 row)
    
     
    
     
    
    apple=# select * from bt_page_stats('test_pkey', 633);
    
    ERROR:  block number out of range
    

    上面可以看到level由1变为了2,root page又由上面的block 3变为了block 412,而block 3蜕变为一个branch节点;增加了两个branch page分别是btpo_flags为0的block 411和 block 3;可以计算到最后一个page应该是(8192/13) + 2 = 632。

    两个branch page是从第一个指向第二个3的下一个是411,在哪里看到3号block是第一个branch page?

    所有的leaf page是从0->1->2->4->5 … ->632->0,怎么知道哪个leaf page是从哪个branch page指出来的呢?

    答案是可以通过看root page和branch page里面的值来看:

    从root page的值可以看到是存放了对应的branch page的块号和偏移量

    apple=# select * from bt_page_items('test_pkey', 412);
    
     itemoffset |   ctid   | itemlen | nulls | vars |          data
    
    ------------+----------+---------+-------+------+-------------------------
    
              1 | (3,0)    |       8 | f     | f    |
    
              2 | (411,31) |      16 | f     | f    | 77 97 01 00 00 00 00 00
    
    (2 rows)
    

    branch page存放了leaf page的block number和offset

    apple=# select * from bt_page_items('test_pkey', 3);
    
     itemoffset |   ctid    | itemlen | nulls | vars |          data
    
    ------------+-----------+---------+-------+------+-------------------------
    
              1 | (287,31)  |      16 | f     | f    | 77 97 01 00 00 00 00 00
    
              2 | (1,0)     |       8 | f     | f    |
    
              3 | (2,7)     |      16 | f     | f    | 6f 01 00 00 00 00 00 00
    
              4 | (4,13)    |      16 | f     | f    | dd 02 00 00 00 00 00 00
    
              5 | (5,19)    |      16 | f     | f    | 4b 04 00 00 00 00 00 00
    
              6 | (6,25)    |      16 | f     | f    | b9 05 00 00 00 00 00 00
    
              7 | (7,31)    |      16 | f     | f    | 27 07 00 00 00 00 00 00
    
              8 | (8,37)    |      16 | f     | f    | 95 08 00 00 00 00 00 00
    
              9 | (9,43)    |      16 | f     | f    | 03 0a 00 00 00 00 00 00
    
     
    

    那么root page和branch page的每个item会存放他们对应的块里面索引key按照B+树的方式进行组织,我们这里B+树高度为2,每页可存放630条记录,因此,可以大致的画出当前的拓扑图:

    当然这些数据都是基于B+树进行按顺序排列的,B+树可以指定树的宽度,我们这边的宽度不指定,而是按实际大小计算的吗?例如这里:8192/13 = 630,即M=630。

    4.扫描时时间消耗估算

    apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id = 11;

                                                             QUERY PLAN

    ----------------------------------------------------------------------------------------------------------------------------

     Index Only Scan using test_pkey on public.test  (cost=0.42..8.44 rows=1 width=4) (actual time=0.247..0.248 rows=1 loops=1)

       Output: id

       Index Cond: (test.id = 11)

       Heap Fetches: 1

       Buffers: shared hit=4

     Planning Time: 0.151 ms

     Execution Time: 0.287 ms

    (7 rows)

    costs = 1 meta page + root page + branch page + leaf page = 4

    apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id < 11;

                                                              QUERY PLAN

    ------------------------------------------------------------------------------------------------------------------------------

     Index Only Scan using test_pkey on public.test  (cost=0.42..8.61 rows=11 width=4) (actual time=0.006..0.009 rows=10 loops=1)

       Output: id

       Index Cond: (test.id < 11)

       Heap Fetches: 10

       Buffers: shared hit=4

     Planning Time: 0.100 ms

     Execution Time: 0.027 ms

    (7 rows)

    costs = 1 meta page + root page + branch page + leaf page = 4,一个块的读取消耗几乎忽略不计。

    apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id in (1,3,1000, 222222, 111111111, 1232244,11);

                                                             QUERY PLAN

    -----------------------------------------------------------------------------------------------------------------------------

     Index Only Scan using test_pkey on public.test  (cost=0.42..35.06 rows=7 width=4) (actual time=0.035..0.088 rows=5 loops=1)

       Output: id

       Index Cond: (test.id = ANY ('{1,3,1000,222222,111111111,1232244,11}'::integer[]))

       Heap Fetches: 5

       Buffers: shared hit=27

     Planning Time: 0.095 ms

     Execution Time: 0.113 ms

    (7 rows)

    只有五条记录在leaf中有,因此:

    costs = 1 meta page + 5*(root page + branch page 1 + branch page 2 + leaf page) + 2* (root page +  branch page 1 + branch page 2) = 27

    后续给出索引每个item对应的代码结构。

    https://www.cnblogs.com/scu-cjx/p/9960483.html

  • 相关阅读:
    建设是为“有” 共享是为“无”
    设计模式-命令模式
    设计模式-建造者模式
    设计模式-抽象工厂模式(升级版工厂方法模式)
    设计模式-原型模式
    设计模式-单例模式
    Java中的多维数组
    设计模式-装饰者
    设计模式-模板方法
    乐观锁与悲观锁
  • 原文地址:https://www.cnblogs.com/kuang17/p/10945223.html
Copyright © 2011-2022 走看看