告警规则
global:
scrape_interval: 15s
evaluation_interval: 15s #每过15秒执行一次报警规则,也就是说15秒执行一次报警
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"] # 设置报警信息推送地址 , 一般而言设置的是alertManager的地址
rule_files:
- "test_rules.yml" # 设置报警规则
scrape_configs:
- job_name: 'node' #自己定义的监控的job_name
static_configs:
- targets: ['localhost:9100']
- job_name: 'CDG-MS'
honor_labels: true
metrics_path: '/prometheus'
static_configs:
- targets: ['localhost:8089']
relabel_configs:
- target_label: env
replacement: dev
- job_name: 'eureka'
file_sd_configs:
- files:
- "/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json"
relabel_configs:
- source_labels: [__job_name__]
regex: (.*)
target_label: job
replacement: ${1}
- target_label: env
replacement: dev
由上面可以看到,我们可以设置报警规则的文件 ,
groups:
- name: example #报警规则组的名字
rules:
- alert: InstanceDown #检测job的状态,持续1分钟metrices不能访问会发给altermanager进行报警
expr: up == 0
for: 1m #持续时间 , 表示持续一分钟获取不到信息,则触发报警
labels:
serverity: page # 自定义标签
annotations:
summary: "Instance {{ $labels.instance }} down" # 自定义摘要
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes." # 自定义具体描述
上面是一个非常通用的一个报警规则,检测应用是否DOWN掉
修改配置后,可以通过该接口重新加载配置: curl -X POST http://localhost:9090/-/reload
在启动的时候一定要用这种方式启动,不然是不可以重新加载配置
./prometheus --config.file=prometheus.yml --web.enable-lifecycle
自定义报警通知
修改prometheus.yml配置文件
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:17201"] # 设置报警信息推送地址
当有报警信息需要通知的时候,会通过上面的配置,推送到localhost:17201 这个服务上去, 推送方式如下:
接口地址:/api/v1/alerts
程序样例:
@RequestMapping(value = "/api/v1/alerts")
public String alert(@RequestBody String body){
log.info("/api/v1/alerts = {}",body);
return "success";
}
入参结构:
[{
"labels": {
"alertname": "InstanceDown",
"env": "dev",
"instance": "10.208.204.46:19999",
"job": "RMS-MS",
"serverity": "page"
},
"annotations": {
"description": "10.208.204.46:19999 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 10.208.204.46:19999 down"
},
"startsAt": "2018-06-19T17:07:54.140071559+08:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"labels": {
"alertname": "InstanceDown",
"env": "dev",
"instance": "10.208.204.46:19999",
"job": "RMS-MS",
"serverity": "page"
},
"annotations": {
"description": "10.208.204.46:19999 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 10.208.204.46:19999 down"
},
"startsAt": "2018-06-19T17:07:54.140071559+08:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"labels": {
"alertname": "InstanceDown",
"env": "dev",
"instance": "192.168.164.1:18093",
"job": "RMS-MS",
"serverity": "page"
},
"annotations": {
"description": "192.168.164.1:18093 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 192.168.164.1:18093 down"
},
"startsAt": "2018-06-19T17:07:54.140071559+08:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
}
]
假如说有RMS-MS三台机器都宕机了的话,那么prometheus会发送如上数据至localhost:17201/api/v1/alerts这个接口,
如此我们就可以根据以上数据做报警通知了
AlertManager
使用prometheus自带的报警组件, 当报警被触发时,prometheus会将报警数据推送给AlertManager , AlertManager 接收到报警信息之后,会根据他这边的规则,然后推送报警通知。
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 30s
#同一组间隔
group_interval: 5m # 同一组的的告警消息间隔,在5m分钟内收到的同一个组的消息,会汇总统一发送
repeat_interval: 1s # 相同的告警消息的重复发送的间隔时间
receiver: 'webhook' # 接受者类型
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://10.208.204.46:17210/test/alert2' # 接收地址
这里选用的是webhook这种方式,AlertManager 会将报警通知推送至 http://10.208.204.46:17210/test/alert2 。
数据结构如下:
{
"receiver": "webhook",
"status": "firing",
"alerts": [{
"status": "firing",
"labels": {},
"annotations": {
"description": "10.208.204.46:19999 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 10.208.204.46:19999 down"
},
"startsAt": "2018-06-19T17:25:54.143824172+08:00",
"endsAt": "0001-01-01T00:00:00Z",
" generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"status": "firing",
"labels": {
"alertname": "InstanceDown",
"env": "dev",
"instance": "192.168.164.1:18093",
"job": "RMS-MS",
"serverity": "page"
},
"annotations": {
"description": "192.168.164.1 :18093 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 192.168.164.1:18093 down"
},
"startsAt": "2018-06-19T17:25:54.143824172+08:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0& g0.tab=1"
}
],
"groupLabels": {
"job": "RMS-MS"
},
"commonLabels": {
"alertname": "InstanceDown",
"env": "dev",
"job": "RMS-MS",
"serverity": "page"
},
"commonAnnotations": {},
"externalURL": "http://localhost.localdomain:9093",
"version": "4",
"groupKey": "{}:{job="RMS-MS"}"
}
假如一个集群三台机器都DOWN的话,那么AlertManager会将三台机器的信息做汇总,然后发送给webhook接口
比较
功能点 | AlertManager | 自定义报警 |
---|---|---|
分组 | 会将同一个分组的报警信息打包做汇总 | 需要自研 |
抑制 | 抑制是指当警报发出后,停止重复发送由此警报引发其他错误的警报的机制。 | 需要自研 |
沉默 | 简单的特定时间静音提醒的机制 | 需要自研 |
缺点 | 不是java开发的,要深入了解困难 | 自研成本高,初期较简陋 |
优点 | 技术成熟 | - |
推荐使用AlertManager做报警通知的第一道关口,后续使用wehbook的方式推送至我方程序。
转自
prometheus告警技术初探(一) | sharedCode https://www.shared-code.com/article/82