zoukankan      html  css  js  c++  java
  • MySQL Performance-Schema(三) 实践篇

         前一篇文章我们分析了Performance-Schema中每个表的用途,以及主要字段的含义,比较侧重于理论的介绍。这篇文章我主要从DBA的角度出发,详细介绍如何通过Performance-Schema得到DBA关心的数据,比如哪个SQL执行次数最多,哪个表访问最频繁,哪个锁最热等信息。通过充分利用Performance-Schema表的数据,让DBA更了解DB的运行状态,也更有助于排查定位问题。本文主要从两方面展开讨论,第一方面是统计信息维度,包括SQL维度,对象维度和等待事件维度三小点;第二方面是定位分析问题维度,结合实际应用场景,利用Performance-Schema搜集的数据进行分析。

    统计信息(SQL维度)
         关于SQL维度的统计信息主要集中在events_statements_summary_by_digest表中,通过将SQL语句抽象出digest,可以统计某类SQL语句在各个维度的统计信息(比如:执行次数,排序次数,使用临时表等)
    (1).哪类SQL执行最多?

    复制代码
    SELECT
    DIGEST_TEXT,
    COUNT_STAR,
    FIRST_SEEN,
    LAST_SEEN
    FROM events_statements_summary_by_digest
    ORDER BY COUNT_STAR DESC limit 1G
    *************************** 1. row ***************************
    DIGEST_TEXT: SELECT `pay_order_id` , `total_fee` , `pay_seller_id` , `pay_buyer_id` , `buyer_id` , `seller_id` FROM `pay_order` WHERE `pay_order_id` = ? 
    COUNT_STAR: 5282893
    FIRST_SEEN: 2015-12-15 18:25:38
    LAST_SEEN: 2015-12-18 16:20:59
    复制代码

    可以看到执行次数最多的SQL是“SELECT `pay_order_id` , `total_fee` , `pay_seller_id` , pay_buyer_id , buyer_id , seller_id FROM tc_pay_order WHERE pay_order_id = ?”,FIRST_SEEN和LAST_SEEN分别显示了语句第一次执行和最后一次执行的时间点。
    (2).哪类SQL的平均响应时间最多?
    AVG_TIMER_WAIT
    (3).哪类SQL排序记录数最多?
    SUM_SORT_ROWS
    (4).哪类SQL扫描记录数最多?
    SUM_ROWS_EXAMINED
    (5).哪类SQL使用临时表最多?
    SUM_CREATED_TMP_TABLES,SUM_CREATED_TMP_DISK_TABLES
    (6).哪类SQL返回结果集最多?
    SUM_ROWS_SENT
         通过上述指标我们可以间接获得某类SQL的逻辑IO(SUM_ROWS_EXAMINED),CPU消耗(SUM_SORT_ROWS),网络带宽(SUM_ROWS_SENT)的对比,但还无法得到某类SQL的物理IO消耗,以及某类SQL访问数据的buffer命中率。

    统计信息(对象维度)
    (1).哪个表物理IO最多?

    复制代码
    SELECT
    file_name,
    event_name,
    SUM_NUMBER_OF_BYTES_READ,
    SUM_NUMBER_OF_BYTES_WRITE
    FROM file_summary_by_instance
    ORDER BY SUM_NUMBER_OF_BYTES_READ + SUM_NUMBER_OF_BYTES_WRITE DESC LIMIT 1G
    *************************** 1. row ***************************
    file_name: /u01/my3306/data/chuck/test_01.ibd
    event_name: wait/io/file/innodb/innodb_data_file
    SUM_NUMBER_OF_BYTES_READ: 2168078336
    SUM_NUMBER_OF_BYTES_WRITE: 1390761934848
    复制代码

    通过file_summary_by_instance表,可以获得系统运行到现在,哪个文件(表)物理IO最多,这可能意味着这个表经常需要访问磁盘IO,从结果来看chuck库里面的test_01的数据文件访问最多。

    (2).哪个表逻辑IO最多?

    复制代码
    SELECT
    object_name,
    COUNT_READ,
    COUNT_WRITE,
    COUNT_FETCH,
    SUM_TIMER_WAIT
    FROM table_io_waits_summary_by_table
    ORDER BY sum_timer_wait DESC limit 1G
    *************************** 1. row ***************************
    object_name: test_slow
    COUNT_READ: 1986130
    COUNT_WRITE: 393222
    COUNT_FETCH: 1986130
    SUM_TIMER_WAIT: 925309746633072
    复制代码

    通过table_io_waits_summary_by_table表,可以获得系统运行到现在,哪个表逻辑IO最多,亦即最“热”的表,从结果来看chuck库里面的test_slow表访问次数最多。

    (3).哪个索引访问最多?

    复制代码
    SELECT
    OBJECT_NAME,
    INDEX_NAME,
    COUNT_FETCH,
    COUNT_INSERT,
    COUNT_UPDATE,
    COUNT_DELETE
    FROM table_io_waits_summary_by_index_usage
    ORDER BY SUM_TIMER_WAIT DESC limit 1G
    *************************** 1. row ***************************
    OBJECT_NAME: test_slow
    INDEX_NAME: PRIMARY
    COUNT_FETCH: 393247
    COUNT_INSERT: 0
    COUNT_UPDATE: 393222
    COUNT_DELETE: 0
    复制代码

    通过table_io_waits_summary_by_index_usage表,可以获得系统运行到现在,哪个表的具体哪个索引(包括主键索引,二级索引)使用最多,从结果来看,我们知道test_slow表访问最多,并且都是通过主键访问。

    (4).哪个索引从来没有使用过?

    复制代码
    SELECT
    OBJECT_SCHEMA,
    OBJECT_NAME,
    INDEX_NAME
    FROM table_io_waits_summary_by_index_usage
    WHERE INDEX_NAME IS NOT NULL
    AND COUNT_STAR = 0
    AND OBJECT_SCHEMA <> 'mysql'
    ORDER BY OBJECT_SCHEMA,OBJECT_NAME;
    *************************** 1. row ***************************
    OBJECT_SCHEMA: chuck
    OBJECT_NAME: test_icp
    INDEX_NAME: idx_y_z
    复制代码

    通过table_io_waits_summary_by_index_usage表,我们还可以获得系统运行到现在,哪些索引从来没有被用过。由于索引也会占用大量的空间,我们可以利用这个统计信息,结合一定的时间策略将无用的索引删除。上面的结果显示,chuck库test_icp表的idx_y_z从来没有被使用过。

    统计信息(等待事件维度)
    (1).哪个等待事件消耗的时间最多?

    复制代码
    SELECT
    EVENT_NAME,
    COUNT_STAR,
    SUM_TIMER_WAIT,
    AVG_TIMER_WAIT
    FROM events_waits_summary_global_by_event_name
    WHERE event_name != 'idle'
    ORDER BY SUM_TIMER_WAIT DESC LIMIT 1;
    *************************** 1. row ***************************
    EVENT_NAME: wait/synch/cond/threadpool/worker_cond
    COUNT_STAR: 26369820
    SUM_TIMER_WAIT: 1604134685750586132
    AVG_TIMER_WAIT: 60832219664
    复制代码

    通过events_waits_summary_global_by_event_name表,可以获取到系统运行到现在,消耗时间最多的事件,当然还可以根据其它维度排序,比如平均等待时间,从结果来看wait/synch/cond/threadpool/worker_cond这个事件消耗的累计时间最长。

    具体场景分析
          前面我们讨论的基本都是汇总信息,有点类似与ORACLE-AWR(Automatic Workload Repository)的味道,那么我们如何利用peformance schema来定位问题呢?或者对于的发生的问题,比如抖动,我们是否有办法知道当时发生了什么?
    (1).剖析某条SQL的执行情况,包括statement信息,stage信息和wait信息。
         有时候我们需要分析具体某条SQL,该SQL在执行各个阶段的时间消耗,通过events_statements_xxx表和events_stages_xxx表,就可以达到目的。两个表通过event_id与nesting_event_id关联,stages表的nesting_event_id为对应statements表的event_id。比如分析包含count(*)的某条SQL语句,具体如下:

    复制代码
    SELECT
    EVENT_ID,
    sql_text
    FROM events_statements_history
    WHERE sql_text LIKE '%count(*)%';
    +----------+--------------------------------------+
    | EVENT_ID | sql_text |
    +----------+--------------------------------------+
    | 1690 | select count(*) from chuck.test_slow |
    +----------+--------------------------------------+
    复制代码

    首先得到了语句的event_id为1690,通过查找events_stages_xxx中nesting_event_id为1690的记录,可以达到目的。
    a.查看每个阶段的时间消耗

    复制代码
    SELECT
    event_id,
    EVENT_NAME,
    SOURCE,
    TIMER_END - TIMER_START
    FROM events_stages_history_long
    WHERE NESTING_EVENT_ID = 1690;
    +----------+--------------------------------+----------------------+-----------------------+
    | event_id | EVENT_NAME | SOURCE | TIMER_END-TIMER_START |
    +----------+--------------------------------+----------------------+-----------------------+
    | 1691 | stage/sql/init | mysqld.cc:990 | 316945000 |
    | 1693 | stage/sql/checking permissions | sql_parse.cc:5776 | 26774000 |
    | 1695 | stage/sql/Opening tables | sql_base.cc:4970 | 41436934000 |
    | 2638 | stage/sql/init | sql_select.cc:1050 | 85757000 |
    | 2639 | stage/sql/System lock | lock.cc:303 | 40017000 |
    | 2643 | stage/sql/optimizing | sql_optimizer.cc:138 | 38562000 |
    | 2644 | stage/sql/statistics | sql_optimizer.cc:362 | 52845000 |
    | 2645 | stage/sql/preparing | sql_optimizer.cc:485 | 53196000 |
    | 2646 | stage/sql/executing | sql_executor.cc:112 | 3153000 |
    | 2647 | stage/sql/Sending data | sql_executor.cc:192 | 7369072089000 |
    | 4304138 | stage/sql/end | sql_select.cc:1105 | 19920000 |
    | 4304139 | stage/sql/query end | sql_parse.cc:5463 | 44721000 |
    | 4304145 | stage/sql/closing tables | sql_parse.cc:5524 | 61723000 |
    | 4304152 | stage/sql/freeing items | sql_parse.cc:6838 | 455678000 |
    | 4304155 | stage/sql/logging slow query | sql_parse.cc:2258 | 83348000 |
    | 4304159 | stage/sql/cleaning up | sql_parse.cc:2163 | 4433000 |
    +----------+--------------------------------+----------------------+-----------------------+
    复制代码

    通过间接关联,我们能分析得到SQL语句在每个阶段的时间消耗,时间单位以皮秒表示。这里展示的结果很类似profiling功能,有了performance schema,就不再需要profiling这个功能了。另外需要注意的是,由于默认情况下events_stages_history表中只为每个连接记录了最近10条记录,为了确保获取所有记录,需要访问events_stages_history_long表。

    b.查看某个阶段的锁等待情况
          针对每个stage可能出现的锁等待,一个stage会对应一个或多个wait,events_waits_history_long这个表容易爆满[默认阀值10000]。由于select count(*)需要IO(逻辑IO或者物理IO),所以在stage/sql/Sending data阶段会有io等待的统计。通过stage_xxx表的event_id字段与waits_xxx表的nesting_event_id进行关联。

    复制代码
    SELECT
    event_id,
    event_name,
    source,
    timer_wait,
    object_name,
    index_name,
    operation,
    nesting_event_id
    FROM events_waits_history_long
    WHERE nesting_event_id = 2647;
    +----------+---------------------------+-----------------+------------+-------------+------------+-----------+------------------+
    | event_id | event_name | source | timer_wait | object_name | index_name | operation | nesting_event_id |
    +----------+---------------------------+-----------------+------------+-------------+------------+-----------+------------------+
    | 190607 | wait/io/table/sql/handler | handler.cc:2842 | 1845888 | test_slow | idx_c1 | fetch | 2647 |
    | 190608 | wait/io/table/sql/handler | handler.cc:2842 | 1955328 | test_slow | idx_c1 | fetch | 2647 |
    | 190609 | wait/io/table/sql/handler | handler.cc:2842 | 1929792 | test_slow | idx_c1 | fetch | 2647 | 
    | 190610 | wait/io/table/sql/handler | handler.cc:2842 | 1869600 | test_slow | idx_c1 | fetch | 2647 |
    | 190611 | wait/io/table/sql/handler | handler.cc:2842 | 1922496 | test_slow | idx_c1 | fetch | 2647 |
    +----------+---------------------------+-----------------+------------+-------------+------------+-----------+------------------+
    复制代码

    通过上面的实验,我们知道了statement,stage,wait的三级结构,通过nesting_event_id进行关联,它表示某个事件的父event_id。

    (2).模拟innodb行锁等待的例子
          会话A执行语句update test_icp set y=y+1 where x=1(x为primary key),不commit;会话B执行同样的语句update test_icp set y=y+1 where x=1,会话B堵塞,并最终报错。通过连接连接查询events_statements_history_long和events_stages_history_long,可以看到在updating阶段花了大约60s的时间。这主要因为实例上的innodb_lock_wait_timeout设置为60,等待60s后超时报错了。

    复制代码
    SELECT
    statement.EVENT_ID,
    stages.event_id,
    statement.sql_text,
    stages.event_name,
    stages.timer_wait
    FROM events_statements_history_long statement 
    join events_stages_history_long stages 
    on statement.event_id=stages.nesting_event_id 
    WHERE statement.sql_text = 'update test_icp set y=y+1 where x=1';
    +----------+----------+-------------------------------------+--------------------------------+----------------+
    | EVENT_ID | event_id | sql_text | event_name | timer_wait |
    +----------+----------+-------------------------------------+--------------------------------+----------------+
    | 5816 | 5817 | update test_icp set y=y+1 where x=1 | stage/sql/init | 195543000 |
    | 5816 | 5819 | update test_icp set y=y+1 where x=1 | stage/sql/checking permissions | 22730000 |
    | 5816 | 5821 | update test_icp set y=y+1 where x=1 | stage/sql/Opening tables | 66079000 |
    | 5816 | 5827 | update test_icp set y=y+1 where x=1 | stage/sql/init | 89116000 |
    | 5816 | 5828 | update test_icp set y=y+1 where x=1 | stage/sql/System lock | 218744000 |
    | 5816 | 5832 | update test_icp set y=y+1 where x=1 | stage/sql/updating | 6001362045000 |
    | 5816 | 5968 | update test_icp set y=y+1 where x=1 | stage/sql/end | 10435000 |
    | 5816 | 5969 | update test_icp set y=y+1 where x=1 | stage/sql/query end | 85979000 |
    | 5816 | 5983 | update test_icp set y=y+1 where x=1 | stage/sql/closing tables | 56562000 |
    | 5816 | 5990 | update test_icp set y=y+1 where x=1 | stage/sql/freeing items | 83563000 |
    | 5816 | 5992 | update test_icp set y=y+1 where x=1 | stage/sql/cleaning up | 4589000 |
    +----------+----------+-------------------------------------+--------------------------------+----------------+
    复制代码

    查看wait事件

    复制代码
    SELECT
    event_id,
    event_name,
    source,
    timer_wait,
    object_name,
    index_name,
    operation,
    nesting_event_id
    FROM events_waits_history_long
    WHERE nesting_event_id = 5832;
    *************************** 1. row ***************************
    event_id: 5832
    event_name: wait/io/table/sql/handler
    source: handler.cc:2782
    timer_wait: 6005946156624
    object_name: test_icp
    index_name: PRIMARY
    operation: fetch
    复制代码

    从结果来看,waits表中记录了一个fetch等待事件,但并没有更细的innodb行锁等待事件统计。

    (3).模拟MDL锁等待的例子
          会话A执行一个大查询select count(*) from test_slow,会话B执行表结构变更alter table test_slow modify c2 varchar(152);通过如下语句可以得到alter语句的执行过程,重点关注“stage/sql/Waiting for table metadata lock”阶段。

    复制代码
    SELECT
    statement.EVENT_ID,
    stages.event_id,
    statement.sql_text,
    stages.event_name as stage_name,
    stages.timer_wait as stage_time
    FROM events_statements_history_long statement 
    left join events_stages_history_long stages 
    on statement.event_id=stages.nesting_event_id
    WHERE statement.sql_text = 'alter table test_slow modify c2 varchar(152)';
    +-----------+-----------+----------------------------------------------+----------------------------------------------------+---------------+
    | EVENT_ID | event_id | sql_text | stage_name | stage_time |
    +-----------+-----------+----------------------------------------------+----------------------------------------------------+---------------+
    | 326526744 | 326526745 | alter table test_slow modify c2 varchar(152) | stage/sql/init | 216662000 |
    | 326526744 | 326526747 | alter table test_slow modify c2 varchar(152) | stage/sql/checking permissions | 18183000 |
    | 326526744 | 326526748 | alter table test_slow modify c2 varchar(152) | stage/sql/checking permissions | 10294000 |
    | 326526744 | 326526750 | alter table test_slow modify c2 varchar(152) | stage/sql/init | 4783000 |
    | 326526744 | 326526751 | alter table test_slow modify c2 varchar(152) | stage/sql/Opening tables | 140172000 |
    | 326526744 | 326526760 | alter table test_slow modify c2 varchar(152) | stage/sql/setup | 157643000 |
    | 326526744 | 326526769 | alter table test_slow modify c2 varchar(152) | stage/sql/creating table | 8723217000 |
    | 326526744 | 326526803 | alter table test_slow modify c2 varchar(152) | stage/sql/After create | 257332000 |
    | 326526744 | 326526832 | alter table test_slow modify c2 varchar(152) | stage/sql/Waiting for table metadata lock | 1000181831000 |
    | 326526744 | 326526835 | alter table test_slow modify c2 varchar(152) | stage/sql/After create | 33483000 |
    | 326526744 | 326526838 | alter table test_slow modify c2 varchar(152) | stage/sql/Waiting for table metadata lock | 1000091810000 |
    | 326526744 | 326526841 | alter table test_slow modify c2 varchar(152) | stage/sql/After create | 17187000 |
    | 326526744 | 326526844 | alter table test_slow modify c2 varchar(152) | stage/sql/Waiting for table metadata lock | 1000126464000 |
    | 326526744 | 326526847 | alter table test_slow modify c2 varchar(152) | stage/sql/After create | 27472000 |
    | 326526744 | 326526850 | alter table test_slow modify c2 varchar(152) | stage/sql/Waiting for table metadata lock | 561996133000 |
    | 326526744 | 326526853 | alter table test_slow modify c2 varchar(152) | stage/sql/After create | 124876000 |
    | 326526744 | 326526877 | alter table test_slow modify c2 varchar(152) | stage/sql/System lock | 30659000 |
    | 326526744 | 326526881 | alter table test_slow modify c2 varchar(152) | stage/sql/preparing for alter table | 40246000 |
    | 326526744 | 326526889 | alter table test_slow modify c2 varchar(152) | stage/sql/altering table | 36628000 |
    | 326526744 | 326526912 | alter table test_slow modify c2 varchar(152) | stage/sql/committing alter table to storage engine | 11846511000 |
    | 326526744 | 326528280 | alter table test_slow modify c2 varchar(152) | stage/sql/end | 43824000 |
    | 326526744 | 326528281 | alter table test_slow modify c2 varchar(152) | stage/sql/query end | 112557000 |
    | 326526744 | 326528299 | alter table test_slow modify c2 varchar(152) | stage/sql/closing tables | 27707000 |
    | 326526744 | 326528305 | alter table test_slow modify c2 varchar(152) | stage/sql/freeing items | 201614000 |
    | 326526744 | 326528308 | alter table test_slow modify c2 varchar(152) | stage/sql/cleaning up | 3584000 |
    +-----------+-----------+----------------------------------------------+----------------------------------------------------+---------------+
    复制代码

    从结果可以看到,出现了多次stage/sql/Waiting for table metadata lock阶段,并且间隔1s,说明每隔1s钟会重试判断。找一个该阶段的event_id,通过nesting_event_id关联,确定到底在等待哪个wait事件。

    复制代码
    SELECT
    event_id,
    event_name,
    source,
    timer_wait,
    object_name,
    index_name,
    operation,
    nesting_event_id
    FROM events_waits_history_long
    WHERE nesting_event_id = 326526850;
    +-----------+---------------------------------------------------+------------------+--------------+-------------+------------+------------+------------------+
    | event_id | event_name | source | timer_wait | object_name | index_name | operation | nesting_event_id |
    +-----------+---------------------------------------------------+------------------+--------------+-------------+------------+------------+------------------+
    | 326526851 | wait/synch/cond/sql/MDL_context::COND_wait_status | mdl.cc:1327 | 562417991328 | NULL | NULL | timed_wait | 326526850 |
    | 326526852 | wait/synch/mutex/mysys/my_thread_var::mutex | sql_class.h:3481 | 733248 | NULL | NULL | lock | 326526850 |
    +-----------+---------------------------------------------------+------------------+--------------+-------------+------------+------------+------------------+
    复制代码

    通过结果可以知道,产生阻塞的是条件变量MDL_context::COND_wait_status,并且显示了代码的位置。

    小结
         本文简单举例说明了如何通过Performance Schema得到数据库运行的统计信息,以及如何利用这些统计信息分析定位问题。通过Performance Schema,DBA可以能深入的理解系统,也能进一步定位问题的源头和本质。

  • 相关阅读:
    linux查看CPU和内存信息
    linux yum命令详解
    查看文件中关键字前后几行的内容
    vue.js+web storm安装及第一个vue.js
    android GPS: code should explicitly check to see if permission is available
    ASP.NET MVC Identity 使用自己的SQL Server数据库
    阿里云服务器,tomcat启动,一直卡在At least one JAR was scanned for TLDs yet contained no TLDs就不动了
    ASP.NET MVC4 MVC 当前上下文中不存在名称“Scripts”
    python 将windows字体中的汉字生成图片的方法
    Java android DES+Base64加密解密
  • 原文地址:https://www.cnblogs.com/zping/p/12132882.html
Copyright © 2011-2022 走看看