zoukankan      html  css  js  c++  java
  • 基于Prometheus+Grafana搭建可视化监控服务 (二) AlertManager告警

    基于Prometheus+Grafana搭建可视化监控服务 (二) AlertManager告警

    上一篇 基于Prometheus+Grafana搭建可视化监控服务 (一) Prometheus

    一、概述

    基于Prometheus+Grafana方案中的告警服务配置有两种方案,一是基于Prometheus的AlertManager,二是基于Grafana的Alert配置,本文描述的是AlertManager方案

    AlertManager官方文档:https://prometheus.io/docs/alerting/latest/alertmanager/

    Prometheus Alert 告警状态有三种状态:Inactive、Pending、Firing。
    Inactive:非活动状态,表示正在监控,但是还未有任何警报触发。
    Pending:表示这个警报必须被触发。由于警报可以被分组、压抑/抑制或静默/静音,所以等待验证,一旦所有的验证都通过,则将转到 Firing 状态。
    Firing:将警报发送到 AlertManager,它将按照配置将警报的发送给所有接收者。一旦警报解除,则将状态转到 Inactive,如此循环。

    二、AlertManager安装

    2.1.下载安装

    [root@server ~]# cd /usr/local/src
    [root@server ~]# wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
    [root@server src]# mkdir -p /usr/local/prometheus/
    [root@server src]# tar xvf alertmanager-0.22.2.linux-amd64.tar.gz -C /usr/local/prometheus/
    [root@server src]# mv /usr/local/prometheus/alertmanager-0.22.2.linux-amd64/ /usr/local/prometheus/alertmanager
    

    2.2.将AlertManager配置成系统服务

    [root@server src]# vi /etc/systemd/system/alertmanager.service
    
    [Unit]
    Description=AlertManager Server
    Documentation=https://prometheus.io/docs/alerting/latest/alertmanager/
    After=network.target
    
    [Service]
    ExecStart=/usr/local/prometheus/alertmanager/alertmanager 
      --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml
    ExecReload=/bin/kill -s HUP $MAINPID
    ExecStop=/bin/kill -SIGINT $MAINPID
    Restart=on-failure
    [Install]
    WantedBy=multi-user.target
    

    2.3.通过systemctl启动alertmanager

    [root@server src]# systemctl daemon-reload
    [root@server src]# systemctl start alertmanager
    [root@server src]# systemctl status alertmanager 
    [root@server src]# systemctl enable alertmanager 
    

    三、配置告警通知方式

    官方文档: https://prometheus.io/docs/alerting/configuration/
    alertmanger.yml主要作用有三个:
    (1)设定告警媒介
    (2)设定告警模版
    (3)路由告警对象。

    [root@server src]# mkdir -p /usr/local/prometheus/alertmanager/templates/
    [root@server src]# vi /usr/local/prometheus/alertmanager/alertmanager.yml
    
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.domain.com'
      smtp_from: 'devops@domain.com'
      smtp_auth_username: 'devops@domain.com'
      smtp_auth_password: '123456'
      smtp_require_tls: false
    templates:
    - '/usr/local/prometheus/alertmanager/templates/*.tmpl'
    route:
      group_by: ['alertname']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 1h
      receiver: 'email'
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://127.0.0.1:5001/'
    - name: 'email'
      email_configs:
      - to: 'admin@domain.com'
        send_resolved: true # 恢复后通知
        #html: '{{ template "email.to.html" . }}' # 邮件模板
        #headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }   #标题
    inhibit_rules:
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['alertname', 'job', 'instance']
    

    简单介绍一下主要配置的作用:
    global: 全局配置,包括报警解决后的超时时间、SMTP 相关配置、各种渠道通知的 API 地址等等。
    route: 用来设置报警的分发策略,它是一个树状结构,按照深度优先从左向右的顺序进行匹配。
    receivers: 配置告警消息接受者信息,例如常用的 email、wechat、slack、webhook 等消息通知方式。
    inhibit_rules: 抑制规则配置,当存在与另一组匹配的警报(源)时,抑制规则将禁用与一组匹配的警报(目标)。

    alertmanager主要处理流程:
    接收到Alert,根据labels判断属于哪些Route(可存在多个Route,一个Route有多个Group,一个Group有多个Alert)
    将Alert分配到Group中,没有则新建Group
    新的Group等待group_wait指定的时间(等待时可能收到同一Group的Alert),根据resolve_timeout判断Alert是否解决,然后发送通知
    已有的Group等待group_interval指定的时间,判断Alert是否解决,当上次发送通知到现在的间隔大于repeat_interval或者Group有更新时会发送通知

    resolve_timeout: 5m # 恢复的超时时间,默认是5分钟
    路由树的根节点,每个传进来的报警从这里开始
    group_by: ['alertname'] #报警分组依据
    group_wait: 10s #组等待时间,初次发警报的延时
    group_interval: 10s #发送前等待时间
    repeat_interval: 1h #重复周期,如果报警发送成功,等待多久重新发送一次
    https://www.kancloud.cn/huyipow/prometheus/527563 这里有更多详细描述]

    equal: ['alertname', 'job', 'instance'] #确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制

    检查配置文件

    [root@server src]# /usr/local/prometheus/alertmanager/amtool check-config /usr/local/prometheus/alertmanager/alertmanager.yml
    Checking 'alertmanager.yml'  SUCCESS
    Found:
    
    - global config
    - route
    - 1 inhibit rules
    - 2 receivers
    - 0 templates
    
    

    多种告警通道配置示例:

     # 路由
    route:
      group_by: ['alertname'] # 报警分组依据
      group_wait: 20s #组等待时间
      group_interval: 20s # 发送前等待时间
      repeat_interval: 12h #重复周期
      receiver: 'email' # 默认警报接收者
      #子路由
      routes:
      - receiver: 'wechat'
        match:
          severity: test  #标签severity为test时满足条,使用wechat警报
    

    四、Prometheus侧配置告警规则

    4.1.配置告警服务和规则目录

    [root@server src]# vi /usr/local/prometheus/prometheus/prometheus.yml
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
           - 127.0.0.1:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      # - "first_rules.yml"
      # - "second_rules.yml"
       - "rules/*.yml"
    

    校验配置正确性

    [root@server src]# /usr/local/prometheus/prometheus/promtool check config /usr/local/prometheus/prometheus/prometheus.yml
    
    #注意:每次调整配置或调整规则文件后都需要通过此检查规则配置正确性
    #然后执行 systemctl reload prometheus 使之规则生效
    

    4.2.创建告警规则

    [root@server src]# mkdir -p /usr/local/prometheus/prometheus/rules/
    [root@server src]# vi /usr/local/prometheus/prometheus/rules/general.yml
    
    groups:
    - name: node-up
      rules:
      - alert: node-up
        expr: up{job="node_exporter"} == 0
        for: 15s
        labels:
          severity: critical
          team: node
        annotations:
          summary: "{{ $labels.instance }} 已停止运行超过 15s!"
          description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!"
    

    参数说明:
    name: node-up #规则名称
    expr: up{job="node_exporter"} == 0 #告警条件
    for: 15s #查询时间间隔
    severity: critical # 告警级别
    annotations: # 注释
    summary: "{{ $labels.instance }} 已停止运行超过 15s!" # 发送告警的内容

    五、在线查看

    Prometheus在线查看规则和告警

    http://ip:9090/rules  #在线查看规则
    http://ip:9090/alerts #在线查看告警
    

    AlertManager内置在线查看告警

    http://ip:9093/#/alerts 
    
    停止目标服务器的node_exporter服务,过一会即会收到告警邮件
    [root@server src]# systemctl stop node_exporter
    
    然后恢复node_exporter服务,过一会即会收到恢复邮件
    [root@server src]# systemctl start node_exporter
    

    六、告警通知模板

    [root@server src]# vi /usr/local/prometheus/alertmanager/templates/email.tmpl
    
    {{ define "email.to.html" }}
    {{- if gt (len .Alerts.Firing) 0 -}}
    {{ range .Alerts }}
    告警程序: prometheus_alert <br>
    告警级别: {{ .Labels.severity }} <br>
    告警类型: {{ .Labels.alertname }} <br>
    故障主机: {{ .Labels.instance }} <br>
    告警主题: {{ .Annotations.summary }}  <br>
    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
    {{ end }}{{ end -}}
    
    {{- if gt (len .Alerts.Resolved) 0 -}}
    {{ range .Alerts }}
    告警程序: prometheus_alert <br>
    告警级别: {{ .Labels.severity }} <br>
    告警类型: {{ .Labels.alertname }} <br>
    故障主机: {{ .Labels.instance }} <br>
    告警主题: {{ .Annotations.summary }} <br>
    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
    恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>
    {{ end }}{{ end -}}
    
    {{- end }}
    

    如果收到的邮件告警中时间相差8小时(使用的是UTC时间),则需要通过增加“.Add 28800e9” 或 .Local.来解决,例如:

    {{ .StartsAt.Format "2006-01-02 15:04:05" }} 改为{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

    {{ .StartsAt.Format "2006-01-02 15:04:05" }} 改为{{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}

    七、其他告警规则参考

    7.1.Linux告警规则

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/linux_rule.yml
    
    groups: 
      - name: linux_alert
        rules: 
          - alert: "linux load5 over 5"
            for: 5s
            expr: node_load5 > 5
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.app }}  over 5,当前值:{{ $value }}"
              summary: "linux load5  over 5"
    
          - alert: "node-up"
            for: 5s
            expr: up{job="node_exporter"}==0
            labels:
              serverity: critical
    		annotations:
    		  summary: "{{ $labels.instance }} 已停止运行超过 15s!"
    		  description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!"
    
          - alert: "cpu used percent over 80% per 1 min"
            for: 5s
            expr: 100 * (1 - avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])))  * on(instance) group_left(hostname) node_uname_info > 80
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
              summary: "cpu used percent over 80% per 1 min"
    
          - alert: "memory used percent over 85%"
            for: 5m
            expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{instance=~"172.*"})) * 100 > 85
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
              summary: "memory used percent over 85%"
    
          - alert: "eth0 input traffic network over 10M"
            for: 3m
            expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance=~"172.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
              summary: "eth0 input traffic network over 10M"
    
          - alert: "eth0 output traffic network over 10M"
            for: 3m
            expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance=~"172.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
              summary: "eth0 output traffic network over 10M"
    
          - alert: "disk usage over 80%"
            for: 10m
            expr: (node_filesystem_size_bytes{device=~"/dev/.+"} - node_filesystem_free_bytes{device=~"/dev/.+"} )/ node_filesystem_size_bytes{device=~"/dev/.+"} * 100 > 80
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.mountpoint }} 分区 over 80%,当前值:{{ $value }}"
              summary: "disk usage over 80%"
    

    7.2.icmp告警规则

    主要用来判断target是否在线或者是有网络抖动

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/check_icmp_rule.yml
    
    groups:
      - name: icmp check
        rules:
          - alert: icmp_check failed
            for: 5s
            expr: probe_success{job="icmp_check"} == 0
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.group }}的{{ $labels.hostname }} icmp检测失败,当前probe_success的值为{ { $value }}"
              summary: "{{ $labels.group }}组的服务器 {{ $labels.hostname }} 服务器检测不通"
    

    7.3.端口监控告警规则

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/port_check_rule.yml
    
    groups:
      - name: tcp port check
        rules:
          - alert: tcp_port_check failed
            for: 5s
            expr: probe_success{job="port_status"} == 0
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.group }}的{{ $labels.service }} tcp检测失败,当前probe_success的值为{ { $value }}"
              summary: "{{ $labels.group }}组的应用 {{ $labels.service }} 端口检测不通"
    

    7.4.URL监控告警规则

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/http_url_check_rule.yml
    
    groups:
      - name: httpd url check
        rules:
          - alert: http_url_check failed
            for: 5s
            expr: probe_success{job="http_status"} == 0
            labels:
              serverity: critical
            annotations:
              description: "{{ $labels.group }}的{{ $labels.service }} url检测失败,当前probe_success的值为{ { $value }}"
              summary: "{{ $labels.group }}组的应用 {{ $labels.service }} url接口检测不通"
    

    7.5.HTTP状态码监控告警规则

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/http_status_code_check_rule.yml
    
    groups:
      - name: http_status_code check
        rules:
          - alert: http_status_code check failed
            for: 1m
            expr: probe_http_status_code{job="http_status"}>=400 and probe_success{job="http_status"}==0
            labels:
              serverity: critical
            annotations:
              summary: '业务报警: 网站不可访问'      
              description: '{{ $labels.service }}组的应用{{$labels.instance}} 不可访问,请及时查看,当前状态码为{{$value}}'
    

    7.6.MySQL告警规则

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/mysql_rule.yml
    
     groups:
        - name: MySQL-rules
          rules:
          - alert: MySQL Status 
            expr: up{job="mysql_exporter"} == 0
            for: 5s 
            labels:
              severity: warning
            annotations:
              summary: "{{$labels.instance}}: MySQL has stop !!!"
              description: "检测MySQL数据库运行状态"
        
          - alert: MySQL Slave IO Thread Status
            expr: mysql_slave_status_slave_io_running == 0
            for: 5s 
            labels:
              severity: warning
            annotations: 
              summary: "{{$labels.instance}}: MySQL Slave IO Thread has stop !!!"
              description: "检测MySQL主从IO线程运行状态"
        
          - alert: MySQL Slave SQL Thread Status 
            expr: mysql_slave_status_slave_sql_running == 0
            for: 5s 
            labels:
              severity: warning
            annotations: 
              summary: "{{$labels.instance}}: MySQL Slave SQL Thread has stop !!!"
              description: "检测MySQL主从SQL线程运行状态"
        
          - alert: MySQL Slave Delay Status 
            expr: mysql_slave_status_sql_delay == 30
            for: 5s 
            labels:
              severity: warning
            annotations: 
              summary: "{{$labels.instance}}: MySQL Slave Delay has more than 30s !!!"
              description: "检测MySQL主从延时状态"
        
          - alert: Mysql_Too_Many_Connections
            expr: rate(mysql_global_status_threads_connected[5m]) > 200
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: "{{$labels.instance}}: 连接数过多"
              description: "{{$labels.instance}}: 连接数过多,请处理 ,(current value is: {{ $value }})"  
        
          - alert: Mysql_Too_Many_slow_queries
            expr: rate(mysql_global_status_slow_queries[5m]) > 3
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: "{{$labels.instance}}: 慢查询有点多,请检查处理"
              description: "{{$labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {{ $value }})"
    

    7.7.Redis告警规则

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/redis_rule.yml
    
    groups:
    - name:  Redis
      rules: 
        - alert: redis-up
          expr: redis_up == 0
          for: 5m
          labels:
            severity: error
          annotations:
            summary: "Redis down (instance {{ $labels.instance }})"
            description: "Redis检测到异常停止!请重点关注!!!,mmp
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
        - alert: MissingBackup
          expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
          for: 5m
          labels:
            severity: error
          annotations:
            summary: "Missing backup (instance {{ $labels.instance }})"
            description: "Redis has not been backuped for 24 hours
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"       
        - alert: OutOfMemory
          expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Out of memory (instance {{ $labels.instance }})"
            description: "Redis is running out of memory (> 90%)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
        - alert: ReplicationBroken
          expr: delta(redis_connected_slaves[1m]) < 0
          for: 5m
          labels:
            severity: error
          annotations:
            summary: "Replication broken (instance {{ $labels.instance }})"
            description: "Redis instance lost a slave
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
        - alert: TooManyConnections
          expr: redis_connected_clients > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Too many connections (instance {{ $labels.instance }})"
            description: "Redis instance has too many connections
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"       
        - alert: NotEnoughConnections
          expr: redis_connected_clients < 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Not enough connections (instance {{ $labels.instance }})"
            description: "Redis instance should have more connections (> 5)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
        - alert: RejectedConnections
          expr: increase(redis_rejected_connections_total[1m]) > 0
          for: 5m
          labels:
            severity: error
          annotations:
            summary: "Rejected connections (instance {{ $labels.instance }})"
            description: "Some connections to Redis has been rejected
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
    

    7.8.elasticsearch监控告警规则

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/ elasticsearch_rule.yml
    
    groups:
      - name: elasticsearch
        rules:
          - record: elasticsearch_filesystem_data_used_percent
            expr: 100 * (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes)
              / elasticsearch_filesystem_data_size_bytes
          - record: elasticsearch_filesystem_data_free_percent
            expr: 100 - elasticsearch_filesystem_data_used_percent
          - alert: ElasticsearchTooFewNodesRunning
            expr: elasticsearch_cluster_health_number_of_nodes < 3
            for: 5m
            labels:
              severity: critical
            annotations:
              description: There are only {{$value}} < 3 ElasticSearch nodes running
              summary: ElasticSearch running on less than 3 nodes
          - alert: ElasticsearchHeapTooHigh
            expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}
              > 0.9
            for: 15m
            labels:
              severity: critical
            annotations:
              description: The heap usage is over 90% for 15m
              summary: ElasticSearch node {{$labels.node}} heap usage is high
    

    7.9.rabbitmq监控告警规则

    [root@server src]# vi /usr/local/prometheus/prometheus/rules/ rabbitmq_rule.yml
    
    groups:
    - name: rabbitmq-up
      rules:
      - alert: rabbitmq-up
        expr: up{job="rabbitmq_exporter"} == 0 
        for: 15s
        annotations:
          summary: "{{ $labels.name }} 已停止运行超过 15s!"
          description: "RabbitMQ,{{$labels.name}} has been down"
    

    7.10.predict_linear警报预测

    通过predict_linear函数可实现预测,例如磁盘多久预计会满

    - name: disk_alerts
      rules:
      - alert: DiskWillFillin4Hours
        expr: predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summmary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.
    

    八、相关参考

  • 相关阅读:
    git报错
    rabbitmq关于guest用户登录失败解决方法
    【转】Linux下RabbitMQ服务器搭建(单实例)
    saltstack安装配置(yum)
    linux下搭建禅道项目管理系统
    git用户限制ssh登录服务器
    中央定调,“新基建”彻底火了!这七大科技领域要爆发
    数据可视化使用小贴士,这样的错误别再犯了
    5G国战:一部国家奋斗的血泪史,看看各国是如何角力百年?
    还没有一个人能够把并发编程讲解的这么透彻
  • 原文地址:https://www.cnblogs.com/huligong1234/p/15143951.html
Copyright © 2011-2022 走看看