zoukankan      html  css  js  c++  java
  • 编写Postgres扩展之三:调试


    在上一篇关于编写Postgres Extensions的文章中,我们从头开始创建了一个新的数据类型base36。 然而,我们留下了一个导致我们的服务器崩溃的严重bug。

    现在让我们用调试器来查找这个错误并完成测试工具。

    我们根据这些系列中关于编写PostgreSQL扩展的内容创建了一个专用的github repo。上一篇文章中的代码可以在分支部分ii上找到,今天的更改在分支部分iii上。

    Bug

    首先让我们复现这个bug。

    test=# CREATE EXTENSION base36;
    test=# CREATE TABLE base36_test(val base36);
    test=#  EXPLAIN SELECT * FROM base36_test where '3c'::base36 > val;
    server closed the connection unexpectedly
      This probably means the server terminated abnormally
      before or while processing the request.
    The connection to the server was lost. Attempting reset: Failed.
    Time: 680,225 ms
    !>
    

    我们当然不希望这种情况发生在我们的生产数据库上,所以让我们找出问题所在。我们只编写了两个相对简单的C函数base36_outbase36_in。如果我们假设我们并不比PostgreSQL核心团队的人更聪明 - 至少对我个人来说这是一个合理的假设 - 那么bug一定在其中一个中。

    #include "postgres.h"
    #include "fmgr.h"
    #include "utils/builtins.h"
    
    PG_MODULE_MAGIC;
    
    PG_FUNCTION_INFO_V1(base36_in);
    Datum
    base36_in(PG_FUNCTION_ARGS)
    {
        long result;
        char *str = PG_GETARG_CSTRING(0);
        result = strtol(str, NULL, 36);
        PG_RETURN_INT32((int32)result);
    }
    
    PG_FUNCTION_INFO_V1(base36_out);
    Datum
    base36_out(PG_FUNCTION_ARGS)
    {
        int32 arg = PG_GETARG_INT32(0);
        if (arg < 0)
            ereport(ERROR,
                (
                 errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                 errmsg("negative values are not allowed"),
                 errdetail("value %d is negative", arg),
                 errhint("make it positive")
                )
            );
        char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
    
        /* max 6 char + '' */
        char *buffer        = palloc(7 * sizeof(char));
        unsigned int offset = 7 * sizeof(char);
        buffer[--offset]    = '';
    
        do {
            buffer[--offset] = base36[arg % 36];
        } while (arg /= 36);
    
        PG_RETURN_CSTRING(&buffer[offset]);
    }
    
    

    设置调试环境

    为了使用想LLDB这样的调试器,你需要开启调试符号编译Postgresql。下面的简短指南适应于安装了Postgresql和Homebrew的MacOS。通过Xcode使用LLDB。

    首先,我们先关闭所有正在运行的Postgresql实例-你不想搞乱你现有的数据库或其他工作:)

    $ cd /usr/local/opt/postgresql
    $ launchctl unload homebrew.mxcl.postgresql.plist
    # Double check it’s not running:
    $ psql some_db
    psql: could not connect to server: No such file or directory
      Is the server running locally and accepting
      connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
    
    

    然后,我们通过运行下面的脚本下载postrgesql源码。

    $ cd ~
    $ curl https://ftp.postgresql.org/pub/source/v9.4.4/postgresql-9.4.4.tar.bz2 | bzip2 -d | tar x
    $ cd postgresql-9.4.4
    

    并启用调试选项进行构建。

    $  ./configure --enable-cassert --enable-debug CFLAGS="-ggdb"
    $ make
    $ sudo make install
    

    我们将跳过Postgres文档推荐的adduser命令。相反,我将使用自己的用户帐户运行Postgres,以便更轻松地进行调试。

    $ sudo chown $(whoami) /usr/local/pgsql
    

    然后初始化数据目录

    $ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
    

    接着启动服务器

    $ /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /usr/local/pgsql/data/postmaster.log start
    
    

    将新安装的pgsql/bin 路径添加到path环境变量中

    $ export PATH=/usr/local/pgsql/bin:$PATH
    

    安装扩展(由于上面的export,这次使用新安装的pgxn

    $ make && make install
    

    现在我们可以创建test数据库了

    $ /usr/local/pgsql/bin/createdb test
    

    然后连接到这个数据库

    $ /usr/local/pgsql/bin/psql test
    

    检查是否正常

    test=# CREATE EXTENSION base36 ;
    CREATE EXTENSION
    test=# CREATE TABLE base36_test(val base36);
    CREATE TABLE
    test=# INSERT INTO base36_test VALUES ('123'), ('3c'), ('5A'), ('zZz');
    INSERT 0 4
    test=# EXPLAIN SELECT * FROM base36_test where val='3c';
    server closed the connection unexpectedly
      This probably means the server terminated abnormally
      before or while processing the request.
    The connection to the server was lost. Attempting reset: Failed.
    !>
    

    调试

    现在我们已经配置好了调试环境,让我们开始实际追踪问题。首先,我们看一下日志文件。也就是我们用-l指定的文件。根据上面的步骤,文件路径是/usr/local/pgsql/data/postmaster.log

    TRAP: FailedAssertion("!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", File: "mcxt.c", Line: 699)
    LOG:  server process (PID 6515) was terminated by signal 6: Abort trap
    DETAIL:  Failed process was running: EXPLAIN SELECT * FROM base36_test where val='3c';
    LOG:  terminating any other active server processes
    WARNING:  terminating connection because of crash of another server process
    DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
    HINT:  In a moment you should be able to reconnect to the database and repeat your command.
    LOG:  all server processes terminated; reinitializing
    LOG:  database system was interrupted; last known up at 2015-10-09 15:11:18 CEST
    LOG:  database system was not properly shut down; automatic recovery in progress
    LOG:  redo starts at 0/22D0868
    LOG:  record with zero length at 0/2359140
    LOG:  redo done at 0/2359110
    LOG:  last completed transaction was at log time 2015-10-09 15:12:01.344859+02
    LOG:  MultiXact member wraparound protections are now enabled
    LOG:  database system is ready to accept connections
    LOG:  autovacuum launcher started
    

    重新连接到数据库并找出当前数据库会话的pid

    test=# SELECT pg_backend_pid();
     pg_backend_pid
    ----------------
               6644
    (1 row)
    

    将LLDB与pid连接(在另一个终端中)

    lldb -p 6644
    

    在psql会话中运行之前失败命令

    EXPLAIN SELECT * FROM base36_test where val='3c';
    

    继续LLDB

    (lldb) c
    Process 6644 resuming
    Process 6644 stopped
    * thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
        frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10
    libsystem_kernel.dylib`__pthread_kill:
    ->  0x7fff906d2286 <+10>: jae    0x7fff906d2290            ; <+20>
        0x7fff906d2288 <+12>: movq   %rax, %rdi
        0x7fff906d228b <+15>: jmp    0x7fff906cdc53            ; cerror_nocancel
        0x7fff906d2290 <+20>: retq
    

    从LLDB中获取到一个回溯

    (lldb) bt
    * thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
      * frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10
        frame #1: 0x00007fff910f39f9 libsystem_pthread.dylib`pthread_kill + 90
        frame #2: 0x00007fff848f19b3 libsystem_c.dylib`abort + 129
        frame #3: 0x0000000108328549 postgres`ExceptionalCondition(conditionName="!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", errorType="FailedAssertion", fileName="mcxt.c", lineNumber=699) + 137 at assert.c:54
        frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699
        frame #5: 0x00000001082ab9e3 postgres`get_const_expr(constval=0x00007ff9e1806708, context=0x00007fff57e824c8, showtype=0) + 707 at ruleutils.c:8002
        frame #6: 0x00000001082a5f79 postgres`get_rule_expr(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='x01') + 281 at ruleutils.c:6647
        frame #7: 0x00000001082acf22 postgres`get_rule_expr_paren(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='x01', parentNode=0x00007ff9e1806788) + 146 at ruleutils.c:6600
    
    ...more
    

    我们得到了什么?异常在mcxt.c:699定义的pfree抛出。pfree在ruleutils.c:8002中被get_const_expr调用。如果我们去调用堆栈四次。 我们最终到了这里:

    (lldb) up
    frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699
       696     * allocated chunk.
       697     */
       698    Assert(pointer != NULL);
    -> 699    Assert(pointer == (void *) MAXALIGN(pointer));
       700
       701    /*
       702     * OK, it's probably safe to look at the chunk header.
    

    让我们看看mcxt.c中的源码

    文件名mcxt.c:699

    /*
     * pfree
     *    Release an allocated chunk.
     */
    void
    pfree(void *pointer)
    {
      MemoryContext context;
    
      /*
       * Try to detect bogus pointers handed to us, poorly though we can.
       * Presumably, a pointer that isn't MAXALIGNED isn't pointing at an
       * allocated chunk.
       */
      Assert(pointer != NULL);
      Assert(pointer == (void *) MAXALIGN(pointer));
    
      /*
       * OK, it's probably safe to look at the chunk header.
       */
      context = ((StandardChunkHeader *)
             ((char *) pointer - STANDARDCHUNKHEADERSIZE))->context;
    
      AssertArg(MemoryContextIsValid(context));
    
      (*context->methods->free_p) (context, pointer);
      VALGRIND_MEMPOOL_FREE(context, pointer);
    }
    

    Postgres使用pfree从当前内存上下文释放内存。不知何故,我们把内存搞砸了。

    我们看一下指针的内容

    (lldb) print (char *)pointer
    (char *) $0 = 0x00007ff9e1813674 "3c"
    

    这确实是我们的搜索条件3c。所以我们这里做错了什么?正如第一篇文章提到的,pfreepalloc是Postgres中freemalloc的对等体,用于在当前内存上下文中安全地分配和释放内存。不知何故我们搞错了。在base36_out中我们使用

    char *buffer = palloc0(7 * sizeof(char));
    

    来分配了7个字节的内存。最终我们偏移4返回一个指针

    PG_RETURN_CSTRING(&buffer[offset]);
    

    mcxt.c:699中的断言如下:

    Assert(pointer == (void *) MAXALIGN(pointer));
    

    确保要释放的数据正确对齐。这里的条件是

    (pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))
    
    

    当指针以8字节的倍数开始时读取?由于返回的地址与分配的地址不同,因此pfree会因指针没有对齐而报错。

    让我们来修复他!

    文件名:base36.c

    PG_FUNCTION_INFO_V1(base36_out);
    Datum
    base36_out(PG_FUNCTION_ARGS)
    {
        int32 arg = PG_GETARG_INT32(0);
        if (arg < 0)
            ereport(ERROR,
                (
                 errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                 errmsg("negative values are not allowed"),
                 errdetail("value %d is negative", arg),
                 errhint("make it positive")
                )
            );
        char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
    
        /* max 6 char + '' */
        char buffer[7];
        unsigned int offset = sizeof(buffer);
        buffer[--offset] = '';
    
        do {
            buffer[--offset] = base36[arg % 36];
        } while (arg /= 36);
    
        PG_RETURN_CSTRING(pstrdup(&buffer[offset]));
    }
    
    

    现在我们从堆栈中分配缓冲区(第18行),最后我们使用pstrdup复制字符串到新分配的内存(第26行)。这种实现更接近 - 几乎相当于维基百科

    您可能已经猜到pstrdup是Postgres对应的strdup。它通过palloc安全地从当前内存上下文中获取内存,并在事务结束时自动释放。

    类型转换

    现在我们可以为我们的类型输入和输出数据了。从其他类型转换或者转换为其他类型都行。

    文件名:base36–0.0.1.sql

    -- type and operators omitted
    
    CREATE CAST (integer as base36) WITHOUT FUNCTION AS IMPLICIT;
    CREATE CAST (base36 as integer) WITHOUT FUNCTION AS IMPLICIT;
    
    

    这是相对容易的。 由于整数和base36是二进制可强制的(即二进制内部表示相同),转换可以免费完成(WITHOUT FUNCTION)。我们还将此转换标记为隐式转换,从而告诉postgres它可以在适当的时候自动执行转换。例如,考虑这个查询:

    test=# SELECT 10::integer + '5a'::base36;
     ?column?
    ----------
          200
    (1 row)
    

    没有定义整数+ base36运算符,但是通过将base36隐式转换为integer Postgres可以使用integer + integer运算符并将结果转换为integer。然而,隐式转换应该小心定义,因为某些操作的结果可能是可疑的。对于上面的操作,用户不知道结果是整数还是base36,因此可能会误解它。如果我们后来决定添加一个返回base36的运算符integer + base36,查询将完全中断。

    更令人困惑的可能是这个查询结果

    test=# SELECT -50::base36;
     ?column?
    ----------
          -50
    (1 row)
    

    虽然我们不允许负值,但我们在这里得到了一个,这怎么可能呢?内部Postgres执行以下操作:

    test=# SELECT -(50::base36)::integer;
     ?column?
    ----------
          -50
    

    我们可以也应该避免这种令人困惑的行为。一种选择是向base36输出添加前缀(就像十六进制或八进制数一样),或者将责任交给用户,只允许显式强制转换。

    澄清事情的另一个选择是标记zhuanhuan AS ASSIGNMENT.
    只有当您为base36类型指定一个整数时,才会自动执行该转换,反之亦然。这通常适用于INSERT或UPDATE语句。让我们试试这个

    文件名:base36–0.0.1.sql

    -- type and operators omitted
    
    CREATE CAST (integer as base36) WITHOUT FUNCTION AS ASSIGNMENT;
    CREATE CAST (base36 as integer) WITHOUT FUNCTION AS ASSIGNMENT;
    

    填充我们的表格

    test=# INSERT INTO base36_test SELECT i FROM generate_series(1,100) i;
    INSERT 0 100
    SELECT * FROM base36_test ORDER BY val LIMIT 12;
     val
    -----
     1
     2
     3
     4
     5
     6
     7
     8
     9
     a
     b
     c
    (12 rows)
    

    更多内容

    You have seen how important it is to test everything, not only to find bugs that in the worst case might crash the server, but also to specify the expected output from certain operations such as casts. In the next post we’ll elaborate on that creating a full-coverage test suite.

  • 相关阅读:
    有关WCSF的几点整理
    应用Response.Write实现带有进度条的多文件上传
    使用Response.Write实现在页面的生命周期中前后台的交互
    多线程实现Thread.Start()与ThreadPool.QueueUserWorkItem两种方式对比
    LinQ to SQL 及 non-LinQ方式实现Group的Performance对比
    Object.assign()方法
    继承
    面对对象和原型链
    画布实现验证码
    日期的格式 字符串转日期
  • 原文地址:https://www.cnblogs.com/taceywong/p/11273634.html
Copyright © 2011-2022 走看看