zoukankan      html  css  js  c++  java
  • 被sleep开了个小玩笑

    本案例转载自李大玉老师分享

    Ⅰ、问题背景

    探活脚本连续8次探测,判断主库异常,触发切换(判断备机是否有延迟,kill原主,VIP飘到备机,设置新主可写)
    切换后,业务还是异常,SQL查询没返回,DB连接数耗完,为了恢复业务,重启新主后业务恢复

    两个问题

    • 主库为什么会切换
    • 切换后新主库为什么还是不可用

    Ⅱ、问题排查

    1、根据系统运行时间、系统日志以及服务器带外日志可排除服务器和数据库OOM
    2、从每分钟的线程快照中发现故障时大量线程处于sending data和statistics状态,但前一分钟快照中未看到任何阻塞与锁等待
    3、被阻塞的sql都是基于主键或者业务索引访问,理论上没问题,提取sql在从库执行一遍,很快,且当时没有产生慢日志,固排除sql执行效率慢导致阻塞
    4、statistics是为sql生成执行计划的,会触发表的统计操作,而统计操作需要对表中page进行采样,会触发io,分析当时磁盘iops、吞吐量、cpu负载等,和前一天基本吻合,排除系统负载导致性能下降

    思路中断~~~

    复盘,开发反映业务切到新db,业务各个接口耗时变大,询问是否新库服务器性能不如老库
    会后上机器发现实例上存在20个处于user sleep状态的sql,大概模型都是where id = '+(select 'rbzd' where 6910=6910 and sleep(300)+)',比较可疑,因为开发不会在程序中调用sleep函数

    重点分析此sql
    此sql不占用系统资源,但是写法可疑,类似sql注入
    经查,sleep操作和innodb_thread_concurrency参数互斥,这样每秒只能处理24个sql
    换言之当有24个线程进入引擎并处于sleep状态的话,其他线程是不能进入innodb引擎层,这里的24是和线上MySQL参数innodb_thread_concurrency被设置为24有关

    Ⅲ、模拟故障

    3.1 准备数据

    CREATE TABLE `test` (
    `id` int(11) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
    
    insert into test select 1;
    

    3.2 开24个线程访问此表

    提前设置innodb_thread_concurrency设置为24模拟线上情况

    [root@VM_42_63_centos ~]# for i in `seq 1 24`; do nohup mysql -S /tmp/mysql.sock3306 -e "select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+''" & done
    

    3.3 观察一波

    执行上一步脚本后,再开一个session访问test表会发现线程hang住,没返回,状态为sending data,如下:

    (root@localhost) [test]> show processlist;
    +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
    | Id | User | Host            | db   | Command | Time | State        | Info                                                                                   |
    +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
    |  2 | root | localhost:38204 | NULL | Sleep   |    0 |              | NULL                                                                                   |
    |  3 | root | localhost       | test | Query   |    0 | starting     | show processlist                                                                       |
    |  4 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    |  5 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    |  6 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    |  7 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    |  8 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    |  9 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 10 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 11 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 12 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 13 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 14 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 15 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 16 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 17 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 18 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 19 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 20 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 21 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 22 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 23 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 24 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 25 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 26 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 27 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
    | 28 | root | localhost       | NULL | Query   |   11 | Sending data | select id from test.test                                                               |
    +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
    27 rows in set (0.00 sec)
    

    Ⅳ、故障回顾

    1、sql注入,user sleep状态的sql累计到24条之后,其他sql就不能进入innodb进行操作,包括高可用探测程序,由于线程快照中过滤了sleep,导致没抓到注入的sql,加大了后续排查难度

    2、主库的存活探测程序检查失败(探测方式为update一个innodb表),连续8次失败后,ha认为实例异常,则kill主,触发切换流程

    3、切换到主库后,注入还在继续,所以同样的故障在新主上重演

    4、重启主库后,仅有20个注入进入innodb且一直为user sleep,由于没达到24个触发阈值,所以业务表现正常,只是性能不及老主库,原因就是已经有20个线程在innodb层一直没退出,kill掉这些线程,业务恢复正常

    Ⅴ、问题解决

    • 由于线程快照过滤了sleep关键字,导致排查过程较长,后续要对user sleep状态线程保留在线程快照中
    • 开发优化程序,防止sql注入
  • 相关阅读:
    bat命令2
    bat实现创建、复制、删除文件及文件夹
    bat命令1
    ckeditor使用教程
    统计SqlServer每张表内的数据量
    entiryFramework 事务控制
    jQuery验证控件jquery.validate.js使用说明
    asp.net中套用母版页之后的findcontrol
    SessionState
    窗体单例问题
  • 原文地址:https://www.cnblogs.com/---wunian/p/9548718.html
Copyright © 2011-2022 走看看