zoukankan html css js c++ java

基于Prometheus+Grafana搭建可视化监控服务 (二) AlertManager告警

上一篇基于Prometheus+Grafana搭建可视化监控服务 (一) Prometheus

一、概述

基于Prometheus+Grafana方案中的告警服务配置有两种方案，一是基于Prometheus的AlertManager,二是基于Grafana的Alert配置，本文描述的是AlertManager方案

AlertManager官方文档：https://prometheus.io/docs/alerting/latest/alertmanager/

Prometheus Alert 告警状态有三种状态：Inactive、Pending、Firing。
Inactive：非活动状态，表示正在监控，但是还未有任何警报触发。
Pending：表示这个警报必须被触发。由于警报可以被分组、压抑/抑制或静默/静音，所以等待验证，一旦所有的验证都通过，则将转到 Firing 状态。
Firing：将警报发送到 AlertManager，它将按照配置将警报的发送给所有接收者。一旦警报解除，则将状态转到 Inactive，如此循环。

二、AlertManager安装

2.1.下载安装

[root@server ~]# cd /usr/local/src
[root@server ~]# wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
[root@server src]# mkdir -p /usr/local/prometheus/
[root@server src]# tar xvf alertmanager-0.22.2.linux-amd64.tar.gz -C /usr/local/prometheus/
[root@server src]# mv /usr/local/prometheus/alertmanager-0.22.2.linux-amd64/ /usr/local/prometheus/alertmanager

2.2.将AlertManager配置成系统服务

[root@server src]# vi /etc/systemd/system/alertmanager.service

[Unit]
Description=AlertManager Server
Documentation=https://prometheus.io/docs/alerting/latest/alertmanager/
After=network.target

[Service]
ExecStart=/usr/local/prometheus/alertmanager/alertmanager 
  --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -SIGINT $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target

2.3.通过systemctl启动alertmanager

[root@server src]# systemctl daemon-reload
[root@server src]# systemctl start alertmanager
[root@server src]# systemctl status alertmanager 
[root@server src]# systemctl enable alertmanager

三、配置告警通知方式

官方文档: https://prometheus.io/docs/alerting/configuration/
alertmanger.yml主要作用有三个：
（1）设定告警媒介
（2）设定告警模版
（3）路由告警对象。

[root@server src]# mkdir -p /usr/local/prometheus/alertmanager/templates/
[root@server src]# vi /usr/local/prometheus/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.domain.com'
  smtp_from: 'devops@domain.com'
  smtp_auth_username: 'devops@domain.com'
  smtp_auth_password: '123456'
  smtp_require_tls: false
templates:
- '/usr/local/prometheus/alertmanager/templates/*.tmpl'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
- name: 'email'
  email_configs:
  - to: 'admin@domain.com'
    send_resolved: true # 恢复后通知
    #html: '{{ template "email.to.html" . }}' # 邮件模板
    #headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }   #标题
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'job', 'instance']

简单介绍一下主要配置的作用：
global: 全局配置，包括报警解决后的超时时间、SMTP 相关配置、各种渠道通知的 API 地址等等。
route: 用来设置报警的分发策略，它是一个树状结构，按照深度优先从左向右的顺序进行匹配。
receivers: 配置告警消息接受者信息，例如常用的 email、wechat、slack、webhook 等消息通知方式。
inhibit_rules: 抑制规则配置，当存在与另一组匹配的警报（源）时，抑制规则将禁用与一组匹配的警报（目标）。

alertmanager主要处理流程:
接收到Alert，根据labels判断属于哪些Route（可存在多个Route，一个Route有多个Group，一个Group有多个Alert）
将Alert分配到Group中，没有则新建Group
新的Group等待group_wait指定的时间（等待时可能收到同一Group的Alert），根据resolve_timeout判断Alert是否解决，然后发送通知
已有的Group等待group_interval指定的时间，判断Alert是否解决，当上次发送通知到现在的间隔大于repeat_interval或者Group有更新时会发送通知

resolve_timeout: 5m # 恢复的超时时间，默认是5分钟
路由树的根节点,每个传进来的报警从这里开始
group_by: ['alertname'] #报警分组依据
group_wait: 10s #组等待时间，初次发警报的延时
group_interval: 10s #发送前等待时间
repeat_interval: 1h #重复周期，如果报警发送成功,等待多久重新发送一次
https://www.kancloud.cn/huyipow/prometheus/527563 这里有更多详细描述]

equal: ['alertname', 'job', 'instance'] #确保这个配置下的标签内容相同才会抑制，也就是说警报中必须有这三个标签值才会被抑制

检查配置文件

[root@server src]# /usr/local/prometheus/alertmanager/amtool check-config /usr/local/prometheus/alertmanager/alertmanager.yml
Checking 'alertmanager.yml'  SUCCESS
Found:

- global config
- route
- 1 inhibit rules
- 2 receivers
- 0 templates

多种告警通道配置示例:

 # 路由
route:
  group_by: ['alertname'] # 报警分组依据
  group_wait: 20s #组等待时间
  group_interval: 20s # 发送前等待时间
  repeat_interval: 12h #重复周期
  receiver: 'email' # 默认警报接收者
  #子路由
  routes:
  - receiver: 'wechat'
    match:
      severity: test  #标签severity为test时满足条，使用wechat警报

四、Prometheus侧配置告警规则

4.1.配置告警服务和规则目录

[root@server src]# vi /usr/local/prometheus/prometheus/prometheus.yml

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
   - "rules/*.yml"

校验配置正确性

[root@server src]# /usr/local/prometheus/prometheus/promtool check config /usr/local/prometheus/prometheus/prometheus.yml

#注意：每次调整配置或调整规则文件后都需要通过此检查规则配置正确性
#然后执行 systemctl reload prometheus 使之规则生效

4.2.创建告警规则

[root@server src]# mkdir -p /usr/local/prometheus/prometheus/rules/
[root@server src]# vi /usr/local/prometheus/prometheus/rules/general.yml

groups:
- name: node-up
  rules:
  - alert: node-up
    expr: up{job="node_exporter"} == 0
    for: 15s
    labels:
      severity: critical
      team: node
    annotations:
      summary: "{{ $labels.instance }} 已停止运行超过 15s！"
      description: "{{ $labels.instance }} 检测到异常停止！请重点关注！！！"

参数说明：
name: node-up #规则名称
expr: up{job="node_exporter"} == 0 #告警条件
for: 15s #查询时间间隔
severity: critical # 告警级别
annotations: # 注释
summary: "{{ $labels.instance }} 已停止运行超过 15s！" # 发送告警的内容

五、在线查看

Prometheus在线查看规则和告警

http://ip:9090/rules  #在线查看规则
http://ip:9090/alerts #在线查看告警

AlertManager内置在线查看告警

http://ip:9093/#/alerts

停止目标服务器的node_exporter服务，过一会即会收到告警邮件
[root@server src]# systemctl stop node_exporter

然后恢复node_exporter服务，过一会即会收到恢复邮件
[root@server src]# systemctl start node_exporter

六、告警通知模板

[root@server src]# vi /usr/local/prometheus/alertmanager/templates/email.tmpl

{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }}  <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end -}}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts }}
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end -}}

{{- end }}

如果收到的邮件告警中时间相差8小时（使用的是UTC时间），则需要通过增加“.Add 28800e9” 或 .Local.来解决，例如：

{{ .StartsAt.Format "2006-01-02 15:04:05" }} 改为{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

或

{{ .StartsAt.Format "2006-01-02 15:04:05" }} 改为{{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}

七、其他告警规则参考

7.1.Linux告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/linux_rule.yml

groups: 
  - name: linux_alert
    rules: 
      - alert: "linux load5 over 5"
        for: 5s
        expr: node_load5 > 5
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }}  over 5,当前值:{{ $value }}"
          summary: "linux load5  over 5"

      - alert: "node-up"
        for: 5s
        expr: up{job="node_exporter"}==0
        labels:
          serverity: critical
		annotations:
		  summary: "{{ $labels.instance }} 已停止运行超过 15s！"
		  description: "{{ $labels.instance }} 检测到异常停止！请重点关注！！！"

      - alert: "cpu used percent over 80% per 1 min"
        for: 5s
        expr: 100 * (1 - avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])))  * on(instance) group_left(hostname) node_uname_info > 80
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "cpu used percent over 80% per 1 min"

      - alert: "memory used percent over 85%"
        for: 5m
        expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{instance=~"172.*"})) * 100 > 85
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "memory used percent over 85%"

      - alert: "eth0 input traffic network over 10M"
        for: 3m
        expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance=~"172.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "eth0 input traffic network over 10M"

      - alert: "eth0 output traffic network over 10M"
        for: 3m
        expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance=~"172.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "eth0 output traffic network over 10M"

      - alert: "disk usage over 80%"
        for: 10m
        expr: (node_filesystem_size_bytes{device=~"/dev/.+"} - node_filesystem_free_bytes{device=~"/dev/.+"} )/ node_filesystem_size_bytes{device=~"/dev/.+"} * 100 > 80
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.mountpoint }} 分区 over 80%,当前值:{{ $value }}"
          summary: "disk usage over 80%"

7.2.icmp告警规则

主要用来判断target是否在线或者是有网络抖动

[root@server src]# vi /usr/local/prometheus/prometheus/rules/check_icmp_rule.yml

groups:
  - name: icmp check
    rules:
      - alert: icmp_check failed
        for: 5s
        expr: probe_success{job="icmp_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.hostname }} icmp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的服务器 {{ $labels.hostname }} 服务器检测不通"

7.3.端口监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/port_check_rule.yml

groups:
  - name: tcp port check
    rules:
      - alert: tcp_port_check failed
        for: 5s
        expr: probe_success{job="port_status"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.service }} tcp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.service }} 端口检测不通"

7.4.URL监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/http_url_check_rule.yml

groups:
  - name: httpd url check
    rules:
      - alert: http_url_check failed
        for: 5s
        expr: probe_success{job="http_status"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.service }} url检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.service }} url接口检测不通"

7.5.HTTP状态码监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/http_status_code_check_rule.yml

groups:
  - name: http_status_code check
    rules:
      - alert: http_status_code check failed
        for: 1m
        expr: probe_http_status_code{job="http_status"}>=400 and probe_success{job="http_status"}==0
        labels:
          serverity: critical
        annotations:
          summary: '业务报警: 网站不可访问'      
          description: '{{ $labels.service }}组的应用{{$labels.instance}} 不可访问,请及时查看,当前状态码为{{$value}}'

7.6.MySQL告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/mysql_rule.yml

 groups:
    - name: MySQL-rules
      rules:
      - alert: MySQL Status 
        expr: up{job="mysql_exporter"} == 0
        for: 5s 
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: MySQL has stop !!!"
          description: "检测MySQL数据库运行状态"
    
      - alert: MySQL Slave IO Thread Status
        expr: mysql_slave_status_slave_io_running == 0
        for: 5s 
        labels:
          severity: warning
        annotations: 
          summary: "{{$labels.instance}}: MySQL Slave IO Thread has stop !!!"
          description: "检测MySQL主从IO线程运行状态"
    
      - alert: MySQL Slave SQL Thread Status 
        expr: mysql_slave_status_slave_sql_running == 0
        for: 5s 
        labels:
          severity: warning
        annotations: 
          summary: "{{$labels.instance}}: MySQL Slave SQL Thread has stop !!!"
          description: "检测MySQL主从SQL线程运行状态"
    
      - alert: MySQL Slave Delay Status 
        expr: mysql_slave_status_sql_delay == 30
        for: 5s 
        labels:
          severity: warning
        annotations: 
          summary: "{{$labels.instance}}: MySQL Slave Delay has more than 30s !!!"
          description: "检测MySQL主从延时状态"
    
      - alert: Mysql_Too_Many_Connections
        expr: rate(mysql_global_status_threads_connected[5m]) > 200
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: 连接数过多"
          description: "{{$labels.instance}}: 连接数过多，请处理 ,(current value is: {{ $value }})"  
    
      - alert: Mysql_Too_Many_slow_queries
        expr: rate(mysql_global_status_slow_queries[5m]) > 3
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: 慢查询有点多，请检查处理"
          description: "{{$labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {{ $value }})"

7.7.Redis告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/redis_rule.yml

groups:
- name:  Redis
  rules: 
    - alert: redis-up
      expr: redis_up == 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Redis down (instance {{ $labels.instance }})"
        description: "Redis检测到异常停止！请重点关注！！！，mmp
  VALUE = {{ $value }}
  LABELS: {{ $labels }}"
    - alert: MissingBackup
      expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Missing backup (instance {{ $labels.instance }})"
        description: "Redis has not been backuped for 24 hours
  VALUE = {{ $value }}
  LABELS: {{ $labels }}"       
    - alert: OutOfMemory
      expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Out of memory (instance {{ $labels.instance }})"
        description: "Redis is running out of memory (> 90%)
  VALUE = {{ $value }}
  LABELS: {{ $labels }}"
    - alert: ReplicationBroken
      expr: delta(redis_connected_slaves[1m]) < 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Replication broken (instance {{ $labels.instance }})"
        description: "Redis instance lost a slave
  VALUE = {{ $value }}
  LABELS: {{ $labels }}"
    - alert: TooManyConnections
      expr: redis_connected_clients > 1000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Too many connections (instance {{ $labels.instance }})"
        description: "Redis instance has too many connections
  VALUE = {{ $value }}
  LABELS: {{ $labels }}"       
    - alert: NotEnoughConnections
      expr: redis_connected_clients < 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Not enough connections (instance {{ $labels.instance }})"
        description: "Redis instance should have more connections (> 5)
  VALUE = {{ $value }}
  LABELS: {{ $labels }}"
    - alert: RejectedConnections
      expr: increase(redis_rejected_connections_total[1m]) > 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Rejected connections (instance {{ $labels.instance }})"
        description: "Some connections to Redis has been rejected
  VALUE = {{ $value }}
  LABELS: {{ $labels }}"

7.8.elasticsearch监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/ elasticsearch_rule.yml

groups:
  - name: elasticsearch
    rules:
      - record: elasticsearch_filesystem_data_used_percent
        expr: 100 * (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes)
          / elasticsearch_filesystem_data_size_bytes
      - record: elasticsearch_filesystem_data_free_percent
        expr: 100 - elasticsearch_filesystem_data_used_percent
      - alert: ElasticsearchTooFewNodesRunning
        expr: elasticsearch_cluster_health_number_of_nodes < 3
        for: 5m
        labels:
          severity: critical
        annotations:
          description: There are only {{$value}} < 3 ElasticSearch nodes running
          summary: ElasticSearch running on less than 3 nodes
      - alert: ElasticsearchHeapTooHigh
        expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}
          > 0.9
        for: 15m
        labels:
          severity: critical
        annotations:
          description: The heap usage is over 90% for 15m
          summary: ElasticSearch node {{$labels.node}} heap usage is high

7.9.rabbitmq监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/ rabbitmq_rule.yml

groups:
- name: rabbitmq-up
  rules:
  - alert: rabbitmq-up
    expr: up{job="rabbitmq_exporter"} == 0 
    for: 15s
    annotations:
      summary: "{{ $labels.name }} 已停止运行超过 15s！"
      description: "RabbitMQ,{{$labels.name}} has been down"

7.10.predict_linear警报预测

通过predict_linear函数可实现预测，例如磁盘多久预计会满

- name: disk_alerts
  rules:
  - alert: DiskWillFillin4Hours
    expr: predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summmary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.

八、相关参考

查看全文

相关阅读:
git报错
 rabbitmq关于guest用户登录失败解决方法
 【转】Linux下RabbitMQ服务器搭建（单实例）
saltstack安装配置（yum）
linux下搭建禅道项目管理系统
 git用户限制ssh登录服务器
 中央定调，“新基建”彻底火了！这七大科技领域要爆发
 数据可视化使用小贴士，这样的错误别再犯了
 5G国战：一部国家奋斗的血泪史，看看各国是如何角力百年？
还没有一个人能够把并发编程讲解的这么透彻

原文地址：https://www.cnblogs.com/huligong1234/p/15143951.html