OLTP系统的问题很难排查和定位,这就是为什么要花那么多钱去请DBA
因为TP系统的请求很多都是毫秒级别,而且同时有大量的并发,所以由于资源,或随机的原因导致的问题,很难去定位根因
哪怕数据库系统尤其是商业数据库系统,已经采集了上千维的对系统和数据库的监控指标,但是仍然很难提高有效的root cause工具,帮助到DBA
DBSherlock就是为了帮助DBA进行异常发现和根因诊断所设计的performance explanation framework。
用户在UI上选取一个异常,DBSherlock就会试图从两个方面给出可能的causes,
Concise predicates describing the combination of system configurations or workload characteristics causing the performance anomaly
High-level diagnoses based on the existing causal models in the system
架构
数据的采集和预处理主要由DBSeer来完成,
采集主要包含如下几部分数据,
1. OS资源消耗
Resource consumption statistics from the OS (in our case, Linux’s/proc data), e.g., per-core CPU usage, number of disk I/Os, number of network packets, number of page faults, number of allocated/free pages, and number of context switches
2. DBMS的负载统计
Workload statistics from the DBMS (in our case, MySQL’s global status variables), e.g., number of logical reads, number of SELECT, UPDATE, DELETE, and INSERT commands executed, number of flushed and dirty pages, and the total lock wait-time.
3. 查询明细,包含时间,耗时,具体的SQL,使用的查询计划
Timestamped query logs, containing start-time, duration, and the SQL statements executed by the system, as well as the query plans used for each query
4. OS和DBMS的配置参数,环境变量,kernel的参数,数据库server的参数,网络配置,相关驱动
Configuration parameters from the OS and the DBMS, e.g., environment variables, kernel parameters, database server configurations, network settings, and (relevant) driver versions.
预处理包含,
首先基于transaction进行基于时间粒度的统计,比如每1秒,平均和分位数的延迟,count等;这些数据代表
To be continue。。。