zoukankan      html  css  js  c++  java
  • [Good for enterprise] GFE我们是怎么监控的?- 2014-08-09更新

    This post will be also published in English: http://www.cnblogs.com/LarryAtCNBlog/p/3900870.html

    GFE 监控相关贴子,按发布时间排序,

    http://www.cnblogs.com/LarryAtCNBlog/p/3890033.html  Eng: http://www.cnblogs.com/LarryAtCNBlog/p/3890743.html

    就在发完上面的贴子之后不久,issue又出现啦,不过其实没有造成user级别的影响,只是我们internal team的人知道。

    这个issue就是第一个GFE监控贴子里提到的5662/5669/5675/5733这几个event,我之前说过这几个event是不应该出现的,如果出现说明的的确确出现了NOC通讯问题,这话其实也没有错,这次的issue的确生成了这几个event中的一两个,但是GFE和NOC之间的连接马上就恢复了,所以并未造成business impact,只是internal team收到了alert。这个问题几个月前就出现过了,和NOC的偶尔通讯失败,导致email有几分钟左右的delay,这种情况其实应该排除掉,否则on-call的那个人半夜收到这样的alert又要起来看什么情况了。

    这次的issue我也给Good开了case,他们没有给出具体的原因,因为这其中网络情况参杂较多,package从我们公司proxy出去之后要经过各种ISP运营商,然后才到达Good,任何一个出现问题都会导致和NOC的通讯失败,而且由于持续时间短的原因和alert延时之类的原因,基本上抓不到trace和ping日志,Good给不出解决方案我也不觉得意外。

    所以最后我转向,询问了Good的Webservice地址和Good logs的decoding方法,我想写个脚本从访问webservice和分析Good logs两方面排除掉上面这种false alert。

    PS: Good的一些关键日志是encoding的,他们可以提供方法decode,但是需要公司里当时和Good定协议的那个人同意,在我这边是一个UK的manager,我想了一下没可能同意,就放弃了Good logs的分析。

    选定了从Good webservice下手,于是拿到了Good的几个web service url。

    https://xml28.good.com/
    https://xml29.good.com/
    https://xml30.good.com/

    基本的想法就是,改动原有的脚本,在抓到了NOC通讯失败的几个event之后,调用探测Webservice的脚本,看看当时webservice是否真得无法reach。

    所以有了下面的Test-NOC.ps1的脚本,所有URL都能reach返回true,任何一个不能返回false,因为在我经验中,GFE会用到所有的webservice,不一定就是一个,可能这其中有Good的balancing机制在其中。

    ### this script invoked by EventID.Monitoring.ps1
    ### Used to test Good NOC connectivity
    
    $GoodNOC_Url = @(
        'https://xml28.good.com/',
        'https://xml29.good.com/',
        'https://xml30.good.com/'
    )
    
    $WebProxy = New-Object 'System.Net.WebProxy'
    # Change below proxy to your own proxy server and port
    $WebProxy.Address = 'http://ProxyServer:Port'
    
    $WebClient = New-Object 'System.Net.WebClient'
    $WebClient.Proxy = $WebProxy
    
    $Result = $true
    foreach($Url in $GoodNOC_Url)
    {
        $LoopCount = 0
        do
        {
            $LoopResult = $false
            $LoopCount++
            if(($WebClient.DownloadString($Url)).Contains('Congratulations!  You have successfully connected to the GoodLink Service.'))
            {
                $LoopResult = $true
                break
            }
        }
        while($LoopCount -lt 3)
        $Result = $Result -and $LoopResult
        if($Result)
        {
            Add-Log -Path $strLogFile_e -Value "NOC Testing succeed: [$Url]" -Type Info
        }
        else
        {
            Add-Log -Path $strLogFile_e -Value "NOC Testing failed: [$Url]" -Type Warning
        }
    }
    
    return $Result

    这样的话,相应的主监控脚本就要相应更新,用于调用上面的子脚本,

    #更改working directory
    Set-Location (Get-Item ($MyInvocation.MyCommand.Definition)).DirectoryName
    
    #定义要监控的event和属性
    #EventClass是说明该event是否和其它event是类似的,同样class的event触发threshold后会触发额外判断脚本
    #ID为需要监控的eventID,如果使用数组如@(xx,yy),说明这两个EventID在统计的时候是一起算的,如xx产生了10条,yy产生了10条,加一起20再和threshold比较
    #Pattern是C#中的正则表达式,用于过滤出含特定字符的event。
    #MinusPattern也是正则表达式,用于过滤出含特定字符的event。
    #如果Pattern和MinusPattern都有值的话,pattern匹配到了100条,而MinusPattern匹配到了90条,减一下最终为10条再和threshold比较,这样可以排除掉“自动恢复的情况”
    #Threshold就是前几属性的匹配过后,与最终值的数值比较,超过threshold就发告警
    $Events = @(
        @{EventClass = 1; ID = 3563; Pattern = 'Pausing .*MAPI error'; MinusPattern = 'Unpausing'; Threshold = 100;},
        @{EventClass = 2; ID = @(1299, 1300, 1301); Pattern = $null; Threshold = 100;},
        @{EventClass = 1; ID = 3386; Pattern = 'GDMAPI_OpenMsgStore failed'; Threshold = 100;},
        @{EventClass = 3; ID = @(5662, 5669); Pattern = $null; Threshold = 1;},
        @{EventClass = 3; ID = 5675; Pattern = 'errNetConnect'; Threshold = 1;},
        @{EventClass = 3; ID = 5733; Pattern = 'errNetTimeout'; Threshold = 1;}
    )
    
    # Script为空的话,说明不触发额外的脚本判断,以threshold为准
    # Script不为空的话,说明触发额外的脚本判断,由该脚本返回true或false来判定最终判断
    $EventClass = @{
        1 = @{Script = $null; Description = 'MAPI Error'};
        2 = @{Script = $null; Description = 'Good thread hung up'};
        3 = @{Script = 'Test-NOC.ps1'; Description = 'Failed to contact NOC'};
    }
    
    $Date = Get-Date
    $strDate = $Date.ToString("yyyy-MM-dd")
    
    $End_time = $Date
    $Start_time = $Date.AddMinutes(-15)
    $strLogFile = "${strDate}.log.txt"
    $strLogFile_e = "${strDate}_Error.log.txt"
    
    #定义邮件发送属性
    $Mail_From = "$($env:COMPUTERNAME)@fil.com"
    $Mail_To = 'xxxxx@xxx.xxx'
    $Mail_Subject = 'Good event IDs warning'
    $Mail_SMTPServer = 'smtpserver'
    
    Set-Content -Path $strLogFile_e -Value $null 
    
    function Add-Log
    {
        PARAM(
            [String]$Path,
            [String]$Value,
            [String]$Type
        )
        $Type = $Type.ToUpper()
        Write-Host "$((Get-Date).ToString('[HH:mm:ss] '))[$Type] $Value"
        if($Path){
            Add-Content -Path $Path -Value "$((Get-Date).ToString('[HH:mm:ss] '))[$Type] $Value"
        }
    }
    
    Add-Log -Path $strLogFile_e -Value "Catch logs after : $($Start_time.ToString('HH:mm:ss'))" -Type Info
    Add-Log -Path $strLogFile_e -Value "Catch logs before: $($End_time.ToString('HH:mm:ss'))" -Type Info
    Add-Log -Path $strLogFile_e -Value "Working directory: $($PWD.Path)" -Type Info
    
    $EventsCache = @(Get-EventLog -LogName Application -After $Start_time -Before $End_time.AddMinutes(5))
    Add-Log -Path $strLogFile_e -Value "Total logs count : $($EventsCache.Count)" -Type Info
    $Error_Array = @()
    foreach($e in $Events)
    {
        $Events_e_ALL = $null
        $Events_e_Matched = $null
        $Events_e_NMatched = $null
        $Events_e_FinalCount = 0
    
        $Events_e_ALL = @($EventsCache | ?{$e.ID -contains $_.EventID})
        Add-Log -Path $strLogFile_e -Value "Captured [$($e.ID -join '], [')], count: $($Events_e_ALL.Count)" -Type Info
        $Events_e_Matched = @($Events_e_ALL | ?{$_.Message -imatch $e.Pattern})
        Add-Log -Path $strLogFile_e -Value "Pattern matched, count: $($Events_e_Matched.Count)" -Type Info
        
        if($e.MinusPattern)
        {
            $Events_e_NMatched = @($Events_e_ALL | ?{$_.Message -imatch $e.MinusPattern})
            Add-Log -Path $strLogFile_e -Value "Minus pattern matched, count: $($Events_e_NMatched.Count)" -Type Info
        }
    
        $Events_e_FinalCount = $Events_e_Matched.Count - [int]$Events_e_NMatched.Count
        Add-Log -Path $strLogFile_e -Value "Final matched, count: $Events_e_FinalCount" -Type Info
        if($Events_e_FinalCount -ge $e.Threshold)
        {
            Add-Log -Path $strLogFile_e -Value "Over threshold: $($e.Threshold)" -Type Warning
            if($Error_Array -notcontains $e.EventClass)
            {
                $Error_Array += $e.EventClass
            }
        }
    }
    
    Add-Log -Path $strLogFile_e -Value "Alert classes captured: [$($Error_Array -join '], [')]" -Type Info
    for($e = 0; $e -lt $Error_Array.Count; $e++)
    {
        Add-Log -Path $strLogFile_e -Value "Process class: [$e]" -Type Info
        if($EventClass.$($Error_Array[$e]).Script -imatch '^$')
        {
            Add-Log -Path $strLogFile_e -Value 'Final script not set, need to send alert.' -Type Warning
        }
        else
        {
            Add-Log -Path $strLogFile_e -Value "Run final script: [$($EventClass.$($Error_Array[$e]).Script)]" -Type Info
            if((& $EventClass.$($Error_Array[$e]).Script) -eq $true)
            {
                Add-Log -Path $strLogFile_e -Value 'Final script: [Positive], no need to send alert.' -Type Info
                $Error_Array[$e] = $null
            }
            else
            {
                Add-Log -Path $strLogFile_e -Value 'Final script: [Negetive], need to send alert' -Type Warning
            }
        }
    }
    
    $Error_Array | %{$Mail_Body = @()}{
        if($_)
        {
            $Mail_Body += $EventClass.$_.Description
        }
    }
    $Mail_Body = $Mail_Body -join "`n"
    
    Add-Log -Path $strLogFile_e -Value "===================split line====================" -Type Info
    Get-Content -Path $strLogFile_e | Add-Content -Path $strLogFile
    
    If($Mail_Body)
    {
        try
        {
            Send-MailMessage -From $Mail_From -To $Mail_To -Subject $Mail_Subject -Body $Mail_Body -SmtpServer $Mail_SMTPServer -Attachments $strLogFile_e
        }
        catch
        {
            Add-Log -Path $strLogFile -Value "Failed to send mail, cause: $($Error[0])" -Type Error
        }
    }
  • 相关阅读:
    Linux部署golang程序(无数据库访问)
    MySQL备份数据库mysqldump
    Linux命令netstat
    SQL优化01(转载)
    springcloud之gateway点滴
    关于数据库错误:serverTimeZone
    代码重构的重要性
    关于集合的泛型
    python 视频下载神器(you-get)
    linux下ssh
  • 原文地址:https://www.cnblogs.com/LarryAtCNBlog/p/3900838.html
Copyright © 2011-2022 走看看