zoukankan      html  css  js  c++  java
  • 捉虫日记 | MySQL 5.7.20 try_acquire_lock_impl 异常导致mysql crash

    背景

    近期线上MySQL 5.7.20集群不定期(多则三周,短则一两天)出现主库mysql crash、触发主从切换问题,堆栈信息如下;

    从堆栈信息可以明显看出,在调用 try_acquire_lock_impl 时触发的crash。

    分析

    在官方Bug库未搜到类似问题,转而从代码库入手,搜到对应的BUG —— 8bc828b982f678d6b57c1853bbe78080c8f84e84

    BUG#26502135: MYSQLD SEGFAULTS IN
    
                  MDL_CONTEXT::TRY_ACQUIRE_LOCK_IMPL
    
    ANALYSIS:
    =========
    Server sometimes exited when multiple threads tried to
    acquire and release metadata locks simultaneously (for
    example, necessary to access a table). The same problem
    could have occurred when new objects were registered/
    deregistered in Performance Schema.
    
    The problem was caused by a bug in LF_HASH - our lock free
    hash implementation which is used by metadata locking
    subsystem in 5.7 branch. In 5.5 and 5.6 we only use LF_HASH
    in Performance Schema Instrumentation implementation. So
    for these versions, the problem was limited to P_S.
    
    The problem was in my_lfind() function, which searches for
    the specific hash element by going through the elements
    list. During this search it loads information about element
    checked such as key pointer and hash value into local
    variables. Then it confirms that they are not corrupted by
    concurrent delete operation (which will set pointer to 0)
    by checking if element is still in the list. The latter
    check did not take into account that compiler (and
    processor) can reorder reads in such a way that load of key
    pointer will happen after it, making result of the check
    invalid.
    
    FIX:
    ====
    This patch fixes the problem by ensuring that no such
    reordering can take place. This is achieved by using
    my_atomic_loadptr() which contains compiler and processor
    memory barriers for the check mentioned above and other
    similar places.
    
    The default (for non-Windows systems) implementation of
    my_atomic*() relies on old __sync intrisics and implements
    my_atomic_loadptr() as read-modify operation. To avoid
    scalability/performance penalty associated with addition of
    my_atomic_loadptr()'s we change the my_atomic*() to use
    newer __atomic intrisics when available. This new default
    implementation doesn't have such a drawback.
    

    大体含义是:

    当多个线程分别同时获取、释放metadata locks时,或者在 Performance Schema 中注册/撤销新的object时,可能会触发该问题,导致 mysql server crash。

    该问题是 LF_HASH(Lock-Free Extensible Hash Tables) 的BUG引起的,那么 LF_HASH 用在什么地方呢?

    1. 在5.5、5.6中只用在 Performance Schema Instrumentation 模块。
    2. 在5.7中也用于metadata加锁模块。

    问题出在my_lfind() 函数中,该函数针对cursor->prev的判断未考虑CAS,该patch通过使用 my_atomic_loadptr() 解决了该问题:

    diff --git a/mysys/lf_hash.c b/mysys/lf_hash.c
    index dc019b07bd9..3a3f665a4f1 100644
    --- a/mysys/lf_hash.c
    +++ b/mysys/lf_hash.c
    @@ -1,4 +1,4 @@
    -/* Copyright (c) 2006, 2016, Oracle and/or its affiliates. All rights reserved.
    +/* Copyright (c) 2006, 2017, Oracle and/or its affiliates. All rights reserved.
     
        This program is free software; you can redistribute it and/or modify
        it under the terms of the GNU General Public License as published by
    @@ -83,7 +83,8 @@ retry:
       do { /* PTR() isn't necessary below, head is a dummy node */
         cursor->curr= (LF_SLIST *)(*cursor->prev);
         _lf_pin(pins, 1, cursor->curr);
    -  } while (*cursor->prev != (intptr)cursor->curr && LF_BACKOFF);
    +  } while (my_atomic_loadptr((void**)cursor->prev) != cursor->curr &&
    +                              LF_BACKOFF);
       for (;;)
       {
         if (unlikely(!cursor->curr))
    @@ -97,7 +98,7 @@ retry:
         cur_hashnr= cursor->curr->hashnr;
         cur_key= cursor->curr->key;
         cur_keylen= cursor->curr->keylen;
    -    if (*cursor->prev != (intptr)cursor->curr)
    +    if (my_atomic_loadptr((void**)cursor->prev) != cursor->curr)
         {
           (void)LF_BACKOFF;
           goto retry;
    

    解决

    查看change log,该问题在5.7.22版本修复的:

    A server exit could result from simultaneous attempts by multiple threads to register and deregister metadata Performance Schema objects, or to acquire and release metadata locks. (Bug #26502135)

    升级内核版本到5.7.29,之后巡检1个月,该问题未再出现,问题解决。

    PS:

    篇幅有限,在后续文章中会单独分析 MDL、LF_HASH 源码,敬请关注。


    欢迎关注我的微信公众号【MySQL数据库技术】。

    知乎 - 数据库技术 专栏: https://zhuanlan.zhihu.com/mysqldb

    思否/segmentfault: https://segmentfault.com/u/dbtech

    开源中国/oschina: https://my.oschina.net/dbtech

    掘金: https://juejin.im/user/5e9d3ed251882538083fed1f/posts

    博客园/cnblogs: https://www.cnblogs.com/dbtech

    莫听竹林打叶声,何妨吟啸且前行。竹杖芒鞋轻胜马,谁怕?一蓑烟雨任平生。
  • 相关阅读:
    教练技术的小应用
    “货品未动,数据先行”,德邦快递与网易云联合打造“智能物流”
    小论数据分析的方法及思维
    网易蜂巢(云计算基础服务)MongoDB服务重磅来袭
    pdfjs viewer 开发小结
    wap html5播放器和直播开发小结
    MongoDB之我是怎么成为Primary节点的
    MongoDB中WiredTiger的数据可用性设置
    AutoMapper 自动映射工具
    linq 左连接实现两个集合的合并
  • 原文地址:https://www.cnblogs.com/dbtech/p/14495875.html
Copyright © 2011-2022 走看看