zoukankan      html  css  js  c++  java
  • SQLServer 通过DMV实现低影响的自动监控和历史场景追溯

    一. 背景:

         我们管理和维护的数据库系统基本都是7*24小时运转的,运转时会出现什么情况谁都无法估计,往往一个平时运行很正常的系统,某天晚上突然就抽风了,而此刻并没有系统负责人在旁边,当值班同事的电话打到正在酣睡的DBA手机上时,DBA不得不朦朦胧胧的,条件反射要上数据库看看;要在家里连到公司的内外一般都需要DBA通知运维人员(也可能是个在做梦的哥们)先给自己开个VPN(某些管控严格的公司还需要DBA打车去公司才能处理),然后通过一大堆的验证才通过慢如蜗牛(往往也和此刻的心态有关)的网络,远程连接到了公司内网出问题的数据库上,此刻DBA才开始真正开始排查起问题来;如果说等待DBA登录到问题数据库上,问题依然在出现,那还算好,可以很快找到问题并处理,但是往往这种抽风的现象是短暂的,可能持续几分钟后,系统又恢复了正常,等到DBA"跑山涉水,翻山越岭" 的好不容易登录到数据库上检查时,数据库里的进程、锁、日志一起都正常,系统运行的很Hai;于是第二天上班时老大们问起昨天晚上事故的原因时,DBA只能凭空猜测,可能是网络、数据库阻塞、抑或是程序方面的问题吧......,此刻不同部门的人都猜是其他部门管理的东西出了问题,于是大家都把手头上的数据、监控图拿出来,证明自己这里是没有问题的;如果监控网络和系统的图和APP的Log都没用出现啥问题(说实话监控也不一定准确的),那就基本要把问题推到数据库身上了(DBA往往成为炮灰)。

          数据库有没有出问题,如果出了问题又是啥问题呢?windows Log和DB Log都没异常,如果我们没有把Profile持续的开启(基于性能的考虑,一般都不会持续开启),就很难说清楚在系统出问题时数据库究竟有没有出问题,出了问题又是因为什么原因引起的;如果说此类问题出现了一次,就消失了,那还算好,成为一个无头案,悲剧的是这类问题无规律,反复的出现,如果DBA不能找出问题的原因,也不能证明数据库当时是正常的话,估计在公司里面就没用立足之地了。

    二 . 对策:

        其实对这种灵异的系统抽风事件,有两种比较好的解决方案:

        1. 开启Trace 跟踪,就是数据库的Profile功能 ,这个还可以结合windows的性能计数器一同使用,能够直观的了解到特殊时间点上运行了什么语句,资源消耗情况是怎样的;不过这个方法比较消耗系统的资源,对访问压力比较大的数据库需要慎重;

        2. 收集数据库DMV当时的情况,使得DBA在故障时间过后,还能通过这些数据了解到事故发生时,数据库里面运行的语句,以及锁和资源的情况;这种方法只是访问系统性能视图,对数据库其他业务影响比较小,也是我接下来要介绍的方案。

    三. 实施方案:

       思路:根据数据库中某些动态指标,触发收集DMV的过程,必要时发送报警邮件

    1. 创建存储系统性能视图的表:

    --创建四个记录表
    USE[master]
    GO

    /****** Object: Table [dbo].[dc_block_info] Script Date: 05/23/2011 11:35:48 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    SET ANSI_PADDING ON
    GO

    CREATETABLE[dbo].[dc_block_info](
    [spid][smallint]NULL,
    [status][nchar](30) NULL,
    [SQLBuffer][nvarchar](max) NULL,
    [hostname][nchar](128) NULL,
    [BlkBy][varchar](10) NULL,
    [BlockedSQLBuffer][nvarchar](max) NULL,
    [LoginName][varchar](100) NULL,
    [DBName][varchar](50) NULL,
    [CPUTime][int]NULL,
    [DiskIO][int]NULL,
    [LastBatch][datetime]NULL,
    [program_name][nchar](128) NULL,
    [Command][varchar](100) NULL,
    [batch_id][int]NULL
    )
    ON[PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO

    USE[master]
    GO

    /****** Object: Table [dbo].[dc_Blocked_Resource_Info] Script Date: 05/23/2011 11:35:48 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    CREATETABLE[dbo].[dc_Blocked_Resource_Info](
    [spid][smallint]NULL,
    [dbid][smallint]NOTNULL,
    [ObjId][int]NOTNULL,
    [IndId][smallint]NOTNULL,
    [Type][nvarchar](4) NULL,
    [Resource][nvarchar](32) NULL,
    [Mode][nvarchar](8) NULL,
    [Status][nvarchar](5) NULL,
    [batch_id][int]NULL
    )
    ON[PRIMARY]

    GO

    USE[master]
    GO

    /****** Object: Table [dbo].[dc_info_BlockedInfo] Script Date: 05/23/2011 11:35:48 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    SET ANSI_PADDING ON
    GO

    CREATETABLE[dbo].[dc_info_BlockedInfo](
    [batch_id][int]NULL,
    [lock_type][varchar](100) NULL,
    [database_name][varchar](20) NULL,
    [blk_object][varchar](100) NULL,
    [lock_req][varchar](100) NULL,
    [waiter_sid][int]NULL,
    [wait_time][int]NULL,
    [waiter_batch][varchar](max) NULL,
    [waiter_stmt][varchar](max) NULL,
    [blocker_sid][int]NULL,
    [blocker_stmt][varchar](max) NULL,
    [create_date][datetime]NULL
    )
    ON[PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO

    USE[master]
    GO

    /****** Object: Table [dbo].[dc_info_SessionConn] Script Date: 05/23/2011 11:35:48 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    SET ANSI_PADDING ON
    GO

    CREATETABLE[dbo].[dc_info_SessionConn](
    [batch_id][int]NULL,
    [session_id][int]NULL,
    [blocking_session_id][int]NULL,
    [textdata][varchar](max) NULL,
    [login_name][varchar](30) NULL,
    [host_name][varchar](100) NULL,
    [database_name][varchar](30) NULL,
    [program_name][varchar](200) NULL,
    [command][varchar](100) NULL,
    [status][varchar](20) NULL,
    [cpu_time][int]NULL,
    [memory_usage_kb][int]NULL,
    [reads][int]NULL,
    [writes][int]NULL,
    [transaction_isolation_level][int]NULL,
    [connect_time][datetime]NULL,
    [last_read][datetime]NULL,
    [last_write][datetime]NULL,
    [net_transport][varchar](20) NULL,
    [client_net_address][varchar](30) NULL,
    [client_tcp_port][int]NULL,
    [local_tcp_port][int]NULL,
    [start_time][datetime]NULL,
    [wait_time][int]NULL,
    [wait_type][varchar](100) NULL,
    [last_wait_type][varchar](100) NULL,
    [wait_resource][varchar](1000) NULL,
    [open_transaction_count][int]NULL,
    [create_date][datetime]NULL
    )
    ON[PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO

    2. 定义性能收集的存储过程:

    USE[master]
    GO

    --数据库的阻塞和锁信息,该SP可以记录session中显示不出的信息
    createprocedure[dbo].[usp_blocker_info](@batch_idint)

    as

    begin

    set nocount on

    declare@spidsmallint,@blockedsmallint

    declare@c_SQLnvarchar(4000),@b_SQLnvarchar(4000)

    declare@idint

    --保存Sysprocesses 的内容

    createtable #Temp(spid smallint,status nchar(30),hostname nchar(128),program_name nchar(128)

    ,cmd
    nchar(16),cpu int,physical_io int,blocked smallint,dbid smallint

    ,loginame
    nchar(128),last_batch datetime

    ,SQLBuffer
    nvarchar(4000),BlockedSQLBuffer nvarchar(4000))


    --保存DBCC InputBuffer 的结果

    createtable #Temp1(id intidentity(1,1),eventtype varchar(20),parameters int,eventinfo nvarchar(4000))

    createtable #Temp_b(id intidentity(1,1),eventtype varchar(20),parameters int,eventinfo nvarchar(4000))


    select*into #Temp2

    from master..sysprocesses (nolock)


    --保存被阻塞的进程信息

    insertinto #Temp(spid ,status ,hostname ,program_name ,cmd ,cpu ,physical_io ,blocked ,dbid

    ,loginame ,last_batch )

    SELECT spid ,status ,hostname ,program_name ,cmd ,cpu ,physical_io ,blocked ,dbid

    ,
    convert(sysname, rtrim(loginame)) ,last_batch

    from #Temp2

    where blocked >0

    --保存阻塞的源头

    insertinto #Temp(spid ,status ,hostname ,program_name ,cmd ,cpu ,physical_io ,blocked ,dbid

    ,loginame ,last_batch )

    SELECT spid ,status ,hostname ,program_name ,cmd ,cpu ,physical_io ,blocked ,dbid

    ,
    convert(sysname, rtrim(loginame)) ,last_batch

    from #Temp2

    where spid in (select blocked from #Temp)

    and spid notin (select spid from #Temp)


    select@spid=min(spid) from #Temp


    while@spidisnotnull

    begin

    set@c_SQL='dbcc inputbuffer('+convert(varchar(5), @spid) +')'

    select@blocked=isnull(blocked,0) from #Tempwhere spid=@spid

    if(@blocked<>0)
    begin
    set@b_SQL='dbcc inputbuffer('+convert(varchar(5), @blocked) +')'
    insertinto #Temp_b
    exec (@b_SQL)

    select@id=@@identity

    update #Temp

    set BlockedSQLBuffer = #Temp_b.eventinfo

    from #Temp,#Temp_b

    where #Temp_b.id =@id

    and #Temp.blocked =@blocked
    end


    insertinto #Temp1

    exec (@c_SQL)


    select@id=@@identity


    update #Temp

    set SQLBuffer = #Temp1.eventinfo

    from #Temp,#Temp1

    where #Temp1.id =@id

    and #Temp.spid =@spid


    select@spid=min(spid) from #Tempwhere spid >@spid

    end


    insertinto dc_block_info(batch_id,spid,status,SQLBuffer,hostname,BlkBy,BlockedSQLBuffer,LoginName,DBName,
    CPUTime,DiskIO,LastBatch,program_name,Command)

    SELECT@batch_id,convert(char(5),spid) SPID,CASElower(status) When'sleeping'Thenlower(status) Elseupper(status) END Status

    ,SQLBuffer

    ,
    CASE hostname WhenNullThen' .'When''Then' .'Else hostname END HostName

    ,
    CASEisnull(convert(char(5),blocked),'0') When'0'Then' .'

    Elseisnull(convert(char(5),blocked),'0') END BlkBy,BlockedSQLBuffer

    ,loginame Login

    ,
    db_name(dbid) DBName,convert(varchar,cpu) CPUTime

    ,
    convert(varchar,physical_io) DiskIO,Last_Batch LastBatch

    ,program_name ProgramName, cmd Command

    from #Temp

    orderby BlkBy, spid


    set nocount off

    end

    Go

    USE[master]
    GO

    --预警SP
    CREATEproc[dbo].[RecodeAndAlertInfo]
    as
    begin
    declare@banch_idint
    select@banch_id=isnull(MAX(batch_id),0)+1from dc_info_SessionConn

    --记录当前所有会话信息
    insertinto dc_info_SessionConn
    SELECT@banch_id,S.session_id, R.blocking_session_id,
    current_execute_sql
    =SUBSTRING(T.text,
    R.statement_start_offset
    /2+1,
    CASE
    WHEN statement_end_offset =-1THENLEN(T.text)
    ELSE (R.statement_end_offset - statement_start_offset) /2+1
    END),
    S.login_name,S.
    host_name,databaseName=DB_NAME(R.database_id),S.program_name,R.command,
    S.status,S.cpu_time, memory_usage_kb
    = S.memory_usage *8, S.reads, S.writes,
    S.transaction_isolation_level,C.connect_time, C.last_read, C.last_write,
    C.net_transport, C.client_net_address, C.client_tcp_port, C.local_tcp_port,
    R.start_time, R.wait_time, R.wait_type, R.last_wait_type, R.wait_resource,
    R.open_transaction_count,
    GETDATE()
    FROM sys.dm_exec_sessions S
    LEFTJOIN sys.dm_exec_connections C
    ON S.session_id = C.session_id
    LEFTJOIN sys.dm_exec_requests R
    ON S.session_id = R.session_id
    AND C.connection_id = R.connection_id
    OUTER APPLY sys.dm_exec_sql_text(R.sql_handle) T
    WHERE S.is_user_process =1-- 如果不限制此条件,则查询所有进程(系统和用户进程)
    AND command isnotnull


    --记录当前阻塞信息
    insertinto dc_info_BlockedInfo
    select@banch_id,t1.resource_type as[lock type] ,db_name(resource_database_id) as[database]
    ,t1.resource_associated_entity_id
    as[blk object]
    ,t1.request_mode
    as[lock req]-- lock requested
    ,t1.request_session_id as[waiter sid]-- spid of waiter
    ,t2.wait_duration_ms as[wait time]
    ,(
    selecttextfrom sys.dm_exec_requests as r --- get sql for waiter
    cross apply sys.dm_exec_sql_text(r.sql_handle)
    where r.session_id = t1.request_session_id) as waiter_batch
    ,(
    selectsubstring(qt.text,r.statement_start_offset/2,
    (
    casewhen r.statement_end_offset =-1thenlen(convert(nvarchar(max), qt.text)) *2
    else r.statement_end_offset end- r.statement_start_offset)/2)
    from sys.dm_exec_requests as r
    cross apply sys.dm_exec_sql_text(r.sql_handle) as qt
    where r.session_id = t1.request_session_id) as waiter_stmt --- statement executing now
    ,t2.blocking_session_id as[blocker sid]--- spid of blocker
    ,(selecttextfrom sys.sysprocesses as p --- get sql for blocker
    cross apply sys.dm_exec_sql_text(p.sql_handle)
    where p.spid = t2.blocking_session_id) as blocker_stmt,getdate()
    from sys.dm_tran_locks as t1, sys.dm_os_waiting_tasks as t2
    where t1.lock_owner_address = t2.resource_address


    --记录资源信息
    insertinto dc_Blocked_Resource_Info
    selectconvert (smallint, req_spid) As spid,
    rsc_dbid
    As dbid,
    rsc_objid
    As ObjId,
    rsc_indid
    As IndId,
    substring (v.name, 1, 4) As Type,
    substring (rsc_text, 1, 32) as Resource,
    substring (u.name, 1, 8) As Mode,
    substring (x.name, 1, 5) As Status ,@banch_id
    from master.dbo.syslockinfo,
    master.dbo.spt_values v,
    master.dbo.spt_values x,
    master.dbo.spt_values u
    where master.dbo.syslockinfo.rsc_type = v.number
    and v.type ='LR'
    and master.dbo.syslockinfo.req_status = x.number
    and x.type ='LS'
    and master.dbo.syslockinfo.req_mode +1= u.number
    and u.type ='L'
    andsubstring (x.name, 1, 5) ='WAIT'
    orderby spid

    exec usp_blocker_info @banch_id

    end

    GO

    3. 创建信息收集和邮件报警的SP:

        说明:

        SP定义了一个邮件发送的过程,需要先配置好数据库的邮件发送(google一下很多的);

        SP需要做到JOB里面,一分钟运行一次收集信息;

        SP收集信息时的条件(不同的系统触发条件不一样):

           a. 用户链接数大于550,活动链接数大于40;

           b. 阻塞进程比率大于10%;

    USE[master]
    GO
    --创建触发监控条件,并发邮件
    CREATEproc[dbo].[ConAlert]
    as
    begin
    DECLARE@connfloat ,@activeconnfloat ,@blockedcountfloat,@spidcountfloat,@spidblockedfloat ,@countint
    DECLARE@bodyNVARCHAR(MAX),@subjectnvarchar(200)

    -- User Connections
    Select@conn=ISNULL(cntr_value,0) from sys.dm_os_performance_counters with(nolock)
    where counter_name='User Connections'

    if@conn>=550--根据用户链接数来收集信息和触发报警
    begin
    --active requests
    Select@activeconn=isnull(SUM(cntr_value),0) from sys.dm_os_performance_counters with(nolock)
    where counter_name ='Active requests'
    groupby counter_name

    set@count=1

    if@activeconn>=40--根据活动链接数来判断
    begin
    while1=1--循环收集信息
    begin
    exec RecodeAndAlertInfo

    if(@count%5=0) --连续次就发邮件
    begin
    set@subject='Server:'+@@SERVERNAME+' Connections Alert'
    SET@body='Server:'+@@SERVERNAME+CHAR(13)+'; UserConnections:'+cast(@connasvarchar)+CHAR(13)+'; ActiveRequests:'+cast(@activeconnasvarchar)
    EXEC msdb.dbo.sp_send_dbmail
    @recipients=N'Ken@xxx.com;itmon@xxx.com',
    @body=@body,
    @body_format='TEXT',
    @subject=@subject,
    @profile_name='dba_profile'--需要配置该模板
    end

    Select@activeconn=isnull(SUM(cntr_value),0) from sys.dm_os_performance_counters with(nolock)
    where counter_name ='Active requests'
    groupby counter_name

    if@activeconn>=40
    waitfor delay '00:00:05'
    else
    break;

    if@count>50
    break;

    set@count=@count+1;
    end
    end
    end
    else--根据Blocked 百分比来收集信息和触发报警
    begin
    select@spidcount=count(0) from sys.sysprocesses with(nolock) where spid>50
    select@blockedcount=count(0) from sys.sysprocesses with(nolock) where spid>50and blocked<>0
    --计算百分比
    set@spidblocked=@blockedcount/@spidcount
    set@count=1

    if(@spidblocked>=0.1) --10%时,记录信息
    begin
    while1=1--循环收集信息
    begin
    exec RecodeAndAlertInfo

    if(@count%5=0or@spidblocked>=0.15) --连续5次阻塞比率大于10%或者阻塞比率大于等于15%时,记录信息并发生报警邮件
    begin
    set@subject='Server:'+@@SERVERNAME+' Blocked Alert'
    SET@body='Server:'+@@SERVERNAME+CHAR(13)+'; BlockedCount:'+cast(@blockedcountasvarchar)+CHAR(13)+'; SPIDCount:'+cast(@spidcountasvarchar)
    EXEC msdb.dbo.sp_send_dbmail
    @recipients=N'Ken@xxx.com',
    @body=@body,
    @body_format='TEXT',
    @subject=@subject,
    @profile_name='dba_profile'
    end

    select@spidcount=count(0) from sys.sysprocesses with(nolock) where spid>50
    select@blockedcount=count(0) from sys.sysprocesses with(nolock) where spid>50and blocked<>0
    set@spidblocked=@blockedcount/@spidcount

    if@spidblocked>=0.1
    waitfor delay '00:00:05'
    else
    break;

    if@count>50
    break;

    set@count=@count+1;
    end
    end
    end
    end

    4. 查看信息:

    USE[master]
    GO
    --通过以下查询来分析信息
    declare@batch_idint
    select@batch_id=isnull(MAX(batch_id),1) from dc_info_SessionConn

    select*from dc_info_SessionConn where batch_id=@batch_id
    select*from dc_info_BlockedInfo where batch_id=@batch_id
    select*from dc_block_info where batch_id=@batch_id
    select*from dc_Blocked_Resource_Info where batch_id=@batch_id

        有了这些阻塞、进程和资源的信息,我们就可以了解系统出问题时,数据库是否出现异常,以及数据库异常时都有些什么语句在系统中运行,它们是否造成了大量的阻塞或消耗了大量的资源;这样DBA就方便定位问题了。


  • 相关阅读:
    10.31JS日记
    10.24JS日记
    10.23JS日记
    10.22JS日记
    10.19JS日记
    10.18JS日记
    Tomcat—Bad Request
    2016年上半年总结
    线程间操作无效
    压缩字符串的函数
  • 原文地址:https://www.cnblogs.com/fygh/p/2182512.html
Copyright © 2011-2022 走看看