Prometheus AlertManager
作者声明:本博客内容是作者在学习以及搭建过程中积累的内容,内容采自网络中各位老师的优秀博客以及视频,并根据作者本人的理解加以修改(由于工作以及学习中东拼西凑,如何造成无法提供原链接,在此抱歉!!!)
作者再次声明:作者只是一个很抠脚的IT工作者,希望可以跟那些提供原创的老师们学习
原文:大米运维
启动alert服务
alert-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: kube-system
labels:
k8s-app: alertmanager
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
replicas: 1
selector:
matchLabels:
k8s-app: alertmanager
version: v0.15.3
template:
metadata:
labels:
k8s-app: alertmanager
version: v0.15.3
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
priorityClassName: system-cluster-critical
containers:
- name: prometheus-alertmanager
image: "prom/alertmanager:v0.15.3"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/alertmanager/config.yml
- --storage.path=/alertmanager/data
- --web.external-url=/
ports:
- containerPort: 9093
readinessProbe:
httpGet:
path: /#/status
port: 9093
initialDelaySeconds: 30
timeoutSeconds: 30
volumeMounts:
- name: alert-config
mountPath: /etc/alertmanager
- name: storage-volume
mountPath: "/alertmanager/data"
subPath: ""
resources:
limits:
cpu: 10m
memory: 50Mi
requests:
cpu: 10m
memory: 50Mi
- name: prometheus-alertmanager-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/alertmanager
- --webhook-url=http://localhost:9093/-/reload
volumeMounts:
- name: alert-config
mountPath: /etc/alertmanager
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
volumes:
- name: alert-config
configMap:
name: alert-config
- name: storage-volume
persistentVolumeClaim:
claimName: alertmanager
configMap重载**是一个简单的二进制文件,用于在 Kubernetes ConfigMaps更新时触发重新加载。 它监视装载的卷目录,并通知目标进程配置映射已经更改。 目前它只支持发送HTTP请求,但在未来它期望支持发送操作系统( 比如 )。 SIGHUP ) 一旦Kubernetes支持pod名称空间
alert-conf.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alert-config
namespace: kube-system
data:
config.yml: |-
global:
# 在没有报警的情况下声明为已解决的时间
resolve_timeout: 5m
# 配置邮件发送信息
smtp_smarthost: 'smtp.163.com:25'
smtp_from: ''
smtp_auth_username: ''
smtp_auth_password: '' #授权密码
smtp_hello: '163.com'
smtp_require_tls: false
# 所有报警信息进入后的根路由,用来设置报警的分发策略
route:
# 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
group_by: ['alertname', 'cluster']
# 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
group_wait: 30s
# 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
group_interval: 5m
# 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
repeat_interval: 5m
# 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
receiver: default
# 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
routes:
- receiver: email
group_wait: 10s
match:
team: node
receivers:
- name: 'default'
email_configs:
- to: '810553413@qq.com'
send_resolved: true
- name: 'email'
email_configs:
- to: '810553413@qq.com'
send_resolved: true
alert-pvc.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: alertmanager
spec:
capacity:
storage: 2Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
#storageClassName: managed-nfs-storage #storageClassName与prometheus-statefulset.yaml中volumeClaimTemplates下定义的需要保持一致
nfs:
server: 192.168.2.7
path: /data/k8s
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: alertmanager
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
spec:
# 使用自己的动态PV
#storageClassName: managed-nfs-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "2Gi"
alert-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "Alertmanager"
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 9093
selector:
k8s-app: alertmanager
type: NodePort
kubectl create -f alert-pvc.yaml
kubectl create -f alert-conf.yaml
kubectl create -f alert-deploy.yaml
kubectl create -f alert-svc.yaml
启动报错
alert-deploy路径问题
kubectl logs -f alertmanager-7d854bcbdf-7kh4k -n kube-system -c prometheus-alertmanager
根据报错信息确定失败原因
level=info ts=2020-04-08T04:08:34.170520801Z caller=main.go:322 msg="Loading configuration file" file=/etc/config/alertmanager.yml
level=error ts=2020-04-08T04:08:34.170585226Z caller=main.go:325 msg="Loading configuration file failed" file=/etc/config/alertmanager.yml err="open /etc/config/alertmanager.yml: no such file or directory"
alert-pvc 创建问题
default-scheduler pod has unbound immediate PersistentVolumeClaims (repeated 2 times)
alert-svc 配置问题
The Service "alertmanager" is invalid: spec.ports[1].name: Duplicate value: "http"
配置Prometheus与Alertmanager通信
编辑 prometheus-configmap.yaml 配置文件添加绑定信息
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:80"]
[root@k8s-master prometheus]# kubectl get svc --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-system alertmanager NodePort 10.110.206.170 <none> 80:30199/TCP 72m
kube-system prometheus NodePort 10.111.194.39 <none> 9090:31611/TCP 93d
alertmanager控制台
prometheus控制台
查看配置是否生效
配置告警
编辑 prometheus.configmap.yaml 添加报警信息
# 添加:指定读取rules配置
rules_files:
- /etc/config/rules/*.rules
kubectl apply -f prometheus.configmap.yaml
故障 无法访问
prometheus无法访问,耐心排查,可以先deploy挂载到容器rules目录,在添加报警信息
编辑报警规则
prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: kube-system
data:
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."
node.rules: |
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"
- alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 20
for: 1m
labels:
severity: warning
team: node
annotations:
summary: "Instance {{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"
- alert: NodeCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"
[root@k8s-master prometheus]# kubectl get cm --all-namespaces
NAMESPACE NAME DATA AGE
kube-public cluster-info 1 152d
kube-system alert-config 1 9h
kube-system coredns 1 152d
kube-system extension-apiserver-authentication 6 152d
kube-system kube-flannel-cfg 2 152d
kube-system kube-proxy 2 152d
kube-system kubeadm-config 2 152d
kube-system kubelet-config-1.16 1 152d
kube-system prometheus-blackbox-exporter 1 29d
kube-system prometheus-config 1 15m
kube-system prometheus-rules 2 6h58m
configmap挂载到容器rules目录
修改挂载点位置,使用之前部署的prometheus.deploy动态PV
volumeMounts:
# 添加:指定rules的configmap配置文件名称
- name: prometheus-rules
mountPath: /etc/config/rules
subPath: ""
volumes:
# 添加:name rules
- name: prometheus-rules
# 添加:配置文件
configMap:
# 添加:定义文件名称
name: prometheus-rules
更改config文件需重启prometheus
创建configmap并更新PV
kubectl apply -f prometheus-rules.yaml
#如果prometheus.deploy更新失败,可以先删除
kubectl delete -f prometheus.deploy.yaml
kubectl apply -f prometheus.deploy.yaml
存储服务器
[root@localhost ~]# cat /etc/exports
/data/k8s 192.168.2.0/24(rw,no_root_squash,sync)
查看alerts告警规则
访问alerts管理后台
我们可以看到页面中出现了我们刚刚定义的报警规则信息,而且报警信息中还有状态显示。一个报警信息在生命周期内有下面3种状态:
- inactive: 表示当前报警信息既不是firing状态也不是pending状态
- pending: 表示在设置的阈值时间范围内被激活了
- firing: 表示超过设置的阈值时间被激活了
模拟报警
修改报警规则 prometheus-rules.yaml
修改后 热更新 kubectl apply -f prometheus-rules.yaml
报警接收
邮件接收
无法发送邮件
先后重启了DNS以及POD,之后恢复了原因不明
注意事项
team:node 标签一定要一致,否则alert无法收到报警
routes:
- receiver: email
group_wait: 10s
match:
team: node
- alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 20
for: 1m
labels:
severity: warning
team: node
annotations:
summary: "Instance {{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"