接前面。
回到程序调用关系上来:
estimate_rel_size -> RelationGetNumberOfBlocks->RelationGetNumberOfBlocksINFork
->Smgrnblocks->mdnblocks...
折腾了一圈,就是为了评估一个表的大小。
那么,我们所获得的block,它到底是个什么单位?
BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum) { MdfdVec *v = mdopen(reln, forknum, EXTENSION_FAIL); BlockNumber nblocks; BlockNumber segno = 0; /* * Skip through any segments that aren't the last one, to avoid redundant * seeks on them. We have previously verified that these segments are * exactly RELSEG_SIZE long, and it's useless to recheck that each time. * * NOTE: this assumption could only be wrong if another backend has * truncated the relation. We rely on higher code levels to handle that * scenario by closing and re-opening the md fd, which is handled via * relcache flush. (Since the checkpointer doesn't participate in * relcache flush, it could have segment chain entries for inactive * segments; that's OK because the checkpointer never needs to compute * relation size.) */ while (v->mdfd_chain != NULL) { segno++; v = v->mdfd_chain; } for (;;) { nblocks = _mdnblocks(reln, forknum, v); fprintf(stderr,"%d blocks by process %d\n\n",nblocks,getpid()); if (nblocks > ((BlockNumber) RELSEG_SIZE)) elog(FATAL, "segment too big"); if (nblocks < ((BlockNumber) RELSEG_SIZE)) return (segno * ((BlockNumber) RELSEG_SIZE)) + nblocks; /* * If segment is exactly RELSEG_SIZE, advance to next one. */ segno++; if (v->mdfd_chain == NULL) { /* * Because we pass O_CREAT, we will create the next segment (with * zero length) immediately, if the last segment is of length * RELSEG_SIZE. While perhaps not strictly necessary, this keeps * the logic simple. */ v->mdfd_chain = _mdfd_openseg(reln, forknum, segno, O_CREAT); if (v->mdfd_chain == NULL) ereport(ERROR, (errcode_for_file_access(), errmsg("could not open file \"%s\": %m", _mdfd_segpath(reln, forknum, segno)))); } v = v->mdfd_chain; } }
还是用实验来验证一下吧:
先建立表:
postgres=# create table tst01(id integer); CREATE TABLE postgres=# postgres=# select oid from pg_class where relname='tst01'; oid ------- 16384 (1 row)
据我所知,PostgreSQL中,integer类型的数据会在每条记录中占用4个字节。
那么我想,4字节×2048条记录=8192字节,也就是8K。
事实如何呢?
[root@lex base]# ls ./12788/16384 ./12788/16384 postgres=# insert into tst01 values(generate_series(1,2048)); INSERT 0 2048 postgres=# [root@lex base]# ls -lrt ./12788/16384 -rw------- 1 postgres postgres 81920 May 28 11:54 ./12788/16384 [root@lex base]# ls -lrt -kb ./12788/16384 -rw------- 1 postgres postgres 80 May 28 11:54 ./12788/16384 [root@lex base]#
不是8K,而是 80K!
数据量再翻上一倍会如何?
postgres=# insert into tst01 values(generate_series(2049,4096)); INSERT 0 2048 postgres=# [root@lex base]# ls -lrt -kb ./12788/16384 -rw------- 1 postgres postgres 152 May 28 11:56 ./12788/16384 [root@lex base]#
原本我以为,8K为单位的block,仅仅是一小部分是冗余数据(如Header),但事实是并非这样。
问了牛人,得到的答复是:
postgres=# select pg_column_size(id) from tst01 limit 1; pg_column_size ---------------- 4 (1 row) postgres=# select pg_column_size(t) from tst01 t limit 1; pg_column_size ---------------- 28 (1 row)
然后再来看程序里对block的处理:
postgres=# select count(*) from tst01; count ------- 4096 (1 row) postgres=#
此时,后台输出的是:
19 blocks by process 4920
19是什么概念:
[root@lex 12788]# ls -lrt 16384 -rw------- 1 postgres postgres 155648 May 28 11:58 16384 [root@lex 12788]# 155648/8096 = 19.225296442688
正好合拍。所以PostgreSQL的源代码中,mdnblocks 取得的block数目,就是 8K为单位的数据块的个数。
从前面的小实验中也可以看到,如果一条记录中的数据较少,header部分所占冗余就占比较大了。
因此,如果想要正确评估一个表所占用的实际空间,基本上要靠抽样了。