zoukankan      html  css  js  c++  java
  • SQLServer 通过DMV实现低影响的自动监控和历史场景追溯

    一. 背景:

         我们管理和维护的数据库系统基本都是7*24小时运转的,运转时会出现什么情况谁都无法估计,往往一个平时运行很正常的系统,某天晚上突然就抽风了,而此刻并没有系统负责人在旁边,当值班同事的电话打到正在酣睡的DBA手机上时,DBA不得不朦朦胧胧的,条件反射要上数据库看看;要在家里连到公司的内外一般都需要DBA通知运维人员(也可能是个在做梦的哥们)先给自己开个VPN(某些管控严格的公司还需要DBA打车去公司才能处理),然后通过一大堆的验证才通过慢如蜗牛(往往也和此刻的心态有关)的网络,远程连接到了公司内网出问题的数据库上,此刻DBA才开始真正开始排查起问题来;如果说等待DBA登录到问题数据库上,问题依然在出现,那还算好,可以很快找到问题并处理,但是往往这种抽风的现象是短暂的,可能持续几分钟后,系统又恢复了正常,等到DBA"跑山涉水,翻山越岭" 的好不容易登录到数据库上检查时,数据库里的进程、锁、日志一起都正常,系统运行的很Hai;于是第二天上班时老大们问起昨天晚上事故的原因时,DBA只能凭空猜测,可能是网络、数据库阻塞、抑或是程序方面的问题吧......,此刻不同部门的人都猜是其他部门管理的东西出了问题,于是大家都把手头上的数据、监控图拿出来,证明自己这里是没有问题的;如果监控网络和系统的图和APP的Log都没用出现啥问题(说实话监控也不一定准确的),那就基本要把问题推到数据库身上了(DBA往往成为炮灰)。

          数据库有没有出问题,如果出了问题又是啥问题呢?windows Log和DB Log都没异常,如果我们没有把Profile持续的开启(基于性能的考虑,一般都不会持续开启),就很难说清楚在系统出问题时数据库究竟有没有出问题,出了问题又是因为什么原因引起的;如果说此类问题出现了一次,就消失了,那还算好,成为一个无头案,悲剧的是这类问题无规律,反复的出现,如果DBA不能找出问题的原因,也不能证明数据库当时是正常的话,估计在公司里面就没用立足之地了。

    二 . 对策:

        其实对这种灵异的系统抽风事件,有两种比较好的解决方案:

        1. 开启Trace 跟踪,就是数据库的Profile功能 ,这个还可以结合windows的性能计数器一同使用,能够直观的了解到特殊时间点上运行了什么语句,资源消耗情况是怎样的;不过这个方法比较消耗系统的资源,对访问压力比较大的数据库需要慎重;

        2. 收集数据库DMV当时的情况,使得DBA在故障时间过后,还能通过这些数据了解到事故发生时,数据库里面运行的语句,以及锁和资源的情况;这种方法只是访问系统性能视图,对数据库其他业务影响比较小,也是我接下来要介绍的方案。

    三. 实施方案:

       思路:根据数据库中某些动态指标,触发收集DMV的过程,必要时发送报警邮件

    1. 创建存储系统性能视图的表:

    --创建四个记录表
    USE[master]
    GO

    /****** Object: Table [dbo].[dc_block_info] Script Date: 05/23/2011 11:35:48 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    SET ANSI_PADDING ON
    GO

    CREATETABLE[dbo].[dc_block_info](
    [spid][smallint]NULL,
    [status][nchar](30) NULL,
    [SQLBuffer][nvarchar](max) NULL,
    [hostname][nchar](128) NULL,
    [BlkBy][varchar](10) NULL,
    [BlockedSQLBuffer][nvarchar](max) NULL,
    [LoginName][varchar](100) NULL,
    [DBName][varchar](50) NULL,
    [CPUTime][int]NULL,
    [DiskIO][int]NULL,
    [LastBatch][datetime]NULL,
    [program_name][nchar](128) NULL,
    [Command][varchar](100) NULL,
    [batch_id][int]NULL
    )
    ON[PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO

    USE[master]
    GO

    /****** Object: Table [dbo].[dc_Blocked_Resource_Info] Script Date: 05/23/2011 11:35:48 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    CREATETABLE[dbo].[dc_Blocked_Resource_Info](
    [spid][smallint]NULL,
    [dbid][smallint]NOTNULL,
    [ObjId][int]NOTNULL,
    [IndId][smallint]NOTNULL,
    [Type][nvarchar](4) NULL,
    [Resource][nvarchar](32) NULL,
    [Mode][nvarchar](8) NULL,
    [Status][nvarchar](5) NULL,
    [batch_id][int]NULL
    )
    ON[PRIMARY]

    GO

    USE[master]
    GO

    /****** Object: Table [dbo].[dc_info_BlockedInfo] Script Date: 05/23/2011 11:35:48 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    SET ANSI_PADDING ON
    GO

    CREATETABLE[dbo].[dc_info_BlockedInfo](
    [batch_id][int]NULL,
    [lock_type][varchar](100) NULL,
    [database_name][varchar](20) NULL,
    [blk_object][varchar](100) NULL,
    [lock_req][varchar](100) NULL,
    [waiter_sid][int]NULL,
    [wait_time][int]NULL,
    [waiter_batch][varchar](max) NULL,
    [waiter_stmt][varchar](max) NULL,
    [blocker_sid][int]NULL,
    [blocker_stmt][varchar](max) NULL,
    [create_date][datetime]NULL
    )
    ON[PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO

    USE[master]
    GO

    /****** Object: Table [dbo].[dc_info_SessionConn] Script Date: 05/23/2011 11:35:48 ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    SET ANSI_PADDING ON
    GO

    CREATETABLE[dbo].[dc_info_SessionConn](
    [batch_id][int]NULL,
    [session_id][int]NULL,
    [blocking_session_id][int]NULL,
    [textdata][varchar](max) NULL,
    [login_name][varchar](30) NULL,
    [host_name][varchar](100) NULL,
    [database_name][varchar](30) NULL,
    [program_name][varchar](200) NULL,
    [command][varchar](100) NULL,
    [status][varchar](20) NULL,
    [cpu_time][int]NULL,
    [memory_usage_kb][int]NULL,
    [reads][int]NULL,
    [writes][int]NULL,
    [transaction_isolation_level][int]NULL,
    [connect_time][datetime]NULL,
    [last_read][datetime]NULL,
    [last_write][datetime]NULL,
    [net_transport][varchar](20) NULL,
    [client_net_address][varchar](30) NULL,
    [client_tcp_port][int]NULL,
    [local_tcp_port][int]NULL,
    [start_time][datetime]NULL,
    [wait_time][int]NULL,
    [wait_type][varchar](100) NULL,
    [last_wait_type][varchar](100) NULL,
    [wait_resource][varchar](1000) NULL,
    [open_transaction_count][int]NULL,
    [create_date][datetime]NULL
    )
    ON[PRIMARY]

    GO

    SET ANSI_PADDING OFF
    GO

    2. 定义性能收集的存储过程:

    USE[master]
    GO

    --数据库的阻塞和锁信息,该SP可以记录session中显示不出的信息
    createprocedure[dbo].[usp_blocker_info](@batch_idint)

    as

    begin

    set nocount on

    declare@spidsmallint,@blockedsmallint

    declare@c_SQLnvarchar(4000),@b_SQLnvarchar(4000)

    declare@idint

    --保存Sysprocesses 的内容

    createtable #Temp(spid smallint,status nchar(30),hostname nchar(128),program_name nchar(128)

    ,cmd
    nchar(16),cpu int,physical_io int,blocked smallint,dbid smallint

    ,loginame
    nchar(128),last_batch datetime

    ,SQLBuffer
    nvarchar(4000),BlockedSQLBuffer nvarchar(4000))


    --保存DBCC InputBuffer 的结果

    createtable #Temp1(id intidentity(1,1),eventtype varchar(20),parameters int,eventinfo nvarchar(4000))

    createtable #Temp_b(id intidentity(1,1),eventtype varchar(20),parameters int,eventinfo nvarchar(4000))


    select*into #Temp2

    from master..sysprocesses (nolock)


    --保存被阻塞的进程信息

    insertinto #Temp(spid ,status ,hostname ,program_name ,cmd ,cpu ,physical_io ,blocked ,dbid

    ,loginame ,last_batch )

    SELECT spid ,status ,hostname ,program_name ,cmd ,cpu ,physical_io ,blocked ,dbid

    ,
    convert(sysname, rtrim(loginame)) ,last_batch

    from #Temp2

    where blocked >0

    --保存阻塞的源头

    insertinto #Temp(spid ,status ,hostname ,program_name ,cmd ,cpu ,physical_io ,blocked ,dbid

    ,loginame ,last_batch )

    SELECT spid ,status ,hostname ,program_name ,cmd ,cpu ,physical_io ,blocked ,dbid

    ,
    convert(sysname, rtrim(loginame)) ,last_batch

    from #Temp2

    where spid in (select blocked from #Temp)

    and spid notin (select spid from #Temp)


    select@spid=min(spid) from #Temp


    while@spidisnotnull

    begin

    set@c_SQL='dbcc inputbuffer('+convert(varchar(5), @spid) +')'

    select@blocked=isnull(blocked,0) from #Tempwhere spid=@spid

    if(@blocked<>0)
    begin
    set@b_SQL='dbcc inputbuffer('+convert(varchar(5), @blocked) +')'
    insertinto #Temp_b
    exec (@b_SQL)

    select@id=@@identity

    update #Temp

    set BlockedSQLBuffer = #Temp_b.eventinfo

    from #Temp,#Temp_b

    where #Temp_b.id =@id

    and #Temp.blocked =@blocked
    end


    insertinto #Temp1

    exec (@c_SQL)


    select@id=@@identity


    update #Temp

    set SQLBuffer = #Temp1.eventinfo

    from #Temp,#Temp1

    where #Temp1.id =@id

    and #Temp.spid =@spid


    select@spid=min(spid) from #Tempwhere spid >@spid

    end


    insertinto dc_block_info(batch_id,spid,status,SQLBuffer,hostname,BlkBy,BlockedSQLBuffer,LoginName,DBName,
    CPUTime,DiskIO,LastBatch,program_name,Command)

    SELECT@batch_id,convert(char(5),spid) SPID,CASElower(status) When'sleeping'Thenlower(status) Elseupper(status) END Status

    ,SQLBuffer

    ,
    CASE hostname WhenNullThen' .'When''Then' .'Else hostname END HostName

    ,
    CASEisnull(convert(char(5),blocked),'0') When'0'Then' .'

    Elseisnull(convert(char(5),blocked),'0') END BlkBy,BlockedSQLBuffer

    ,loginame Login

    ,
    db_name(dbid) DBName,convert(varchar,cpu) CPUTime

    ,
    convert(varchar,physical_io) DiskIO,Last_Batch LastBatch

    ,program_name ProgramName, cmd Command

    from #Temp

    orderby BlkBy, spid


    set nocount off

    end

    Go

    USE[master]
    GO

    --预警SP
    CREATEproc[dbo].[RecodeAndAlertInfo]
    as
    begin
    declare@banch_idint
    select@banch_id=isnull(MAX(batch_id),0)+1from dc_info_SessionConn

    --记录当前所有会话信息
    insertinto dc_info_SessionConn
    SELECT@banch_id,S.session_id, R.blocking_session_id,
    current_execute_sql
    =SUBSTRING(T.text,
    R.statement_start_offset
    /2+1,
    CASE
    WHEN statement_end_offset =-1THENLEN(T.text)
    ELSE (R.statement_end_offset - statement_start_offset) /2+1
    END),
    S.login_name,S.
    host_name,databaseName=DB_NAME(R.database_id),S.program_name,R.command,
    S.status,S.cpu_time, memory_usage_kb
    = S.memory_usage *8, S.reads, S.writes,
    S.transaction_isolation_level,C.connect_time, C.last_read, C.last_write,
    C.net_transport, C.client_net_address, C.client_tcp_port, C.local_tcp_port,
    R.start_time, R.wait_time, R.wait_type, R.last_wait_type, R.wait_resource,
    R.open_transaction_count,
    GETDATE()
    FROM sys.dm_exec_sessions S
    LEFTJOIN sys.dm_exec_connections C
    ON S.session_id = C.session_id
    LEFTJOIN sys.dm_exec_requests R
    ON S.session_id = R.session_id
    AND C.connection_id = R.connection_id
    OUTER APPLY sys.dm_exec_sql_text(R.sql_handle) T
    WHERE S.is_user_process =1-- 如果不限制此条件,则查询所有进程(系统和用户进程)
    AND command isnotnull


    --记录当前阻塞信息
    insertinto dc_info_BlockedInfo
    select@banch_id,t1.resource_type as[lock type] ,db_name(resource_database_id) as[database]
    ,t1.resource_associated_entity_id
    as[blk object]
    ,t1.request_mode
    as[lock req]-- lock requested
    ,t1.request_session_id as[waiter sid]-- spid of waiter
    ,t2.wait_duration_ms as[wait time]
    ,(
    selecttextfrom sys.dm_exec_requests as r --- get sql for waiter
    cross apply sys.dm_exec_sql_text(r.sql_handle)
    where r.session_id = t1.request_session_id) as waiter_batch
    ,(
    selectsubstring(qt.text,r.statement_start_offset/2,
    (
    casewhen r.statement_end_offset =-1thenlen(convert(nvarchar(max), qt.text)) *2
    else r.statement_end_offset end- r.statement_start_offset)/2)
    from sys.dm_exec_requests as r
    cross apply sys.dm_exec_sql_text(r.sql_handle) as qt
    where r.session_id = t1.request_session_id) as waiter_stmt --- statement executing now
    ,t2.blocking_session_id as[blocker sid]--- spid of blocker
    ,(selecttextfrom sys.sysprocesses as p --- get sql for blocker
    cross apply sys.dm_exec_sql_text(p.sql_handle)
    where p.spid = t2.blocking_session_id) as blocker_stmt,getdate()
    from sys.dm_tran_locks as t1, sys.dm_os_waiting_tasks as t2
    where t1.lock_owner_address = t2.resource_address


    --记录资源信息
    insertinto dc_Blocked_Resource_Info
    selectconvert (smallint, req_spid) As spid,
    rsc_dbid
    As dbid,
    rsc_objid
    As ObjId,
    rsc_indid
    As IndId,
    substring (v.name, 1, 4) As Type,
    substring (rsc_text, 1, 32) as Resource,
    substring (u.name, 1, 8) As Mode,
    substring (x.name, 1, 5) As Status ,@banch_id
    from master.dbo.syslockinfo,
    master.dbo.spt_values v,
    master.dbo.spt_values x,
    master.dbo.spt_values u
    where master.dbo.syslockinfo.rsc_type = v.number
    and v.type ='LR'
    and master.dbo.syslockinfo.req_status = x.number
    and x.type ='LS'
    and master.dbo.syslockinfo.req_mode +1= u.number
    and u.type ='L'
    andsubstring (x.name, 1, 5) ='WAIT'
    orderby spid

    exec usp_blocker_info @banch_id

    end

    GO

    3. 创建信息收集和邮件报警的SP:

        说明:

        SP定义了一个邮件发送的过程,需要先配置好数据库的邮件发送(google一下很多的);

        SP需要做到JOB里面,一分钟运行一次收集信息;

        SP收集信息时的条件(不同的系统触发条件不一样):

           a. 用户链接数大于550,活动链接数大于40;

           b. 阻塞进程比率大于10%;

    USE[master]
    GO
    --创建触发监控条件,并发邮件
    CREATEproc[dbo].[ConAlert]
    as
    begin
    DECLARE@connfloat ,@activeconnfloat ,@blockedcountfloat,@spidcountfloat,@spidblockedfloat ,@countint
    DECLARE@bodyNVARCHAR(MAX),@subjectnvarchar(200)

    -- User Connections
    Select@conn=ISNULL(cntr_value,0) from sys.dm_os_performance_counters with(nolock)
    where counter_name='User Connections'

    if@conn>=550--根据用户链接数来收集信息和触发报警
    begin
    --active requests
    Select@activeconn=isnull(SUM(cntr_value),0) from sys.dm_os_performance_counters with(nolock)
    where counter_name ='Active requests'
    groupby counter_name

    set@count=1

    if@activeconn>=40--根据活动链接数来判断
    begin
    while1=1--循环收集信息
    begin
    exec RecodeAndAlertInfo

    if(@count%5=0) --连续次就发邮件
    begin
    set@subject='Server:'+@@SERVERNAME+' Connections Alert'
    SET@body='Server:'+@@SERVERNAME+CHAR(13)+'; UserConnections:'+cast(@connasvarchar)+CHAR(13)+'; ActiveRequests:'+cast(@activeconnasvarchar)
    EXEC msdb.dbo.sp_send_dbmail
    @recipients=N'Ken@xxx.com;itmon@xxx.com',
    @body=@body,
    @body_format='TEXT',
    @subject=@subject,
    @profile_name='dba_profile'--需要配置该模板
    end

    Select@activeconn=isnull(SUM(cntr_value),0) from sys.dm_os_performance_counters with(nolock)
    where counter_name ='Active requests'
    groupby counter_name

    if@activeconn>=40
    waitfor delay '00:00:05'
    else
    break;

    if@count>50
    break;

    set@count=@count+1;
    end
    end
    end
    else--根据Blocked 百分比来收集信息和触发报警
    begin
    select@spidcount=count(0) from sys.sysprocesses with(nolock) where spid>50
    select@blockedcount=count(0) from sys.sysprocesses with(nolock) where spid>50and blocked<>0
    --计算百分比
    set@spidblocked=@blockedcount/@spidcount
    set@count=1

    if(@spidblocked>=0.1) --10%时,记录信息
    begin
    while1=1--循环收集信息
    begin
    exec RecodeAndAlertInfo

    if(@count%5=0or@spidblocked>=0.15) --连续5次阻塞比率大于10%或者阻塞比率大于等于15%时,记录信息并发生报警邮件
    begin
    set@subject='Server:'+@@SERVERNAME+' Blocked Alert'
    SET@body='Server:'+@@SERVERNAME+CHAR(13)+'; BlockedCount:'+cast(@blockedcountasvarchar)+CHAR(13)+'; SPIDCount:'+cast(@spidcountasvarchar)
    EXEC msdb.dbo.sp_send_dbmail
    @recipients=N'Ken@xxx.com',
    @body=@body,
    @body_format='TEXT',
    @subject=@subject,
    @profile_name='dba_profile'
    end

    select@spidcount=count(0) from sys.sysprocesses with(nolock) where spid>50
    select@blockedcount=count(0) from sys.sysprocesses with(nolock) where spid>50and blocked<>0
    set@spidblocked=@blockedcount/@spidcount

    if@spidblocked>=0.1
    waitfor delay '00:00:05'
    else
    break;

    if@count>50
    break;

    set@count=@count+1;
    end
    end
    end
    end

    4. 查看信息:

    USE[master]
    GO
    --通过以下查询来分析信息
    declare@batch_idint
    select@batch_id=isnull(MAX(batch_id),1) from dc_info_SessionConn

    select*from dc_info_SessionConn where batch_id=@batch_id
    select*from dc_info_BlockedInfo where batch_id=@batch_id
    select*from dc_block_info where batch_id=@batch_id
    select*from dc_Blocked_Resource_Info where batch_id=@batch_id

        有了这些阻塞、进程和资源的信息,我们就可以了解系统出问题时,数据库是否出现异常,以及数据库异常时都有些什么语句在系统中运行,它们是否造成了大量的阻塞或消耗了大量的资源;这样DBA就方便定位问题了。


  • 相关阅读:
    linux 命令——19 find (转)
    linux 命令——18 locate (转)
    linux 命令——17 whereis(转)
    linux 命令——16 which(转)
    linux 命令——15 tail (转)
    linux 命令——14 head (转)
    Java for LeetCode 038 Count and Say
    Java for LeetCode 037 Sudoku Solver
    Java for LeetCode 036 Valid Sudoku
    Java for LeetCode 035 Search Insert Position
  • 原文地址:https://www.cnblogs.com/fygh/p/2182512.html
Copyright © 2011-2022 走看看