zoukankan      html  css  js  c++  java
  • prometheus.(8).AlertManager

    Prometheus AlertManager

    作者声明:本博客内容是作者在学习以及搭建过程中积累的内容,内容采自网络中各位老师的优秀博客以及视频,并根据作者本人的理解加以修改(由于工作以及学习中东拼西凑,如何造成无法提供原链接,在此抱歉!!!)

    作者再次声明:作者只是一个很抠脚的IT工作者,希望可以跟那些提供原创的老师们学习

    原文:大米运维

    启动alert服务

    alert-deploy.yaml

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: alertmanager
      namespace: kube-system
      labels:
        k8s-app: alertmanager
        kubernetes.io/cluster-service: "true"
        addonmanager.kubernetes.io/mode: Reconcile
    spec:
      replicas: 1
      selector:
        matchLabels:
          k8s-app: alertmanager
          version: v0.15.3
      template:
        metadata:
          labels:
            k8s-app: alertmanager
            version: v0.15.3
          annotations:
            scheduler.alpha.kubernetes.io/critical-pod: ''
        spec:
          priorityClassName: system-cluster-critical
          containers:
            - name: prometheus-alertmanager
              image: "prom/alertmanager:v0.15.3"
              imagePullPolicy: "IfNotPresent"
              args:
                - --config.file=/etc/alertmanager/config.yml
                - --storage.path=/alertmanager/data
                - --web.external-url=/
              ports:
                - containerPort: 9093
              readinessProbe:
                httpGet:
                  path: /#/status
                  port: 9093
                initialDelaySeconds: 30
                timeoutSeconds: 30
              volumeMounts:
                - name: alert-config
                  mountPath: /etc/alertmanager
                - name: storage-volume
                  mountPath: "/alertmanager/data"
                  subPath: ""
              resources:
                limits:
                  cpu: 10m
                  memory: 50Mi
                requests:
                  cpu: 10m
                  memory: 50Mi
            - name: prometheus-alertmanager-configmap-reload
              image: "jimmidyson/configmap-reload:v0.1"
              imagePullPolicy: "IfNotPresent"
              args:
                - --volume-dir=/etc/alertmanager
                - --webhook-url=http://localhost:9093/-/reload
              volumeMounts:
                - name: alert-config
                  mountPath: /etc/alertmanager
                  readOnly: true
              resources:
                limits:
                  cpu: 10m
                  memory: 10Mi
                requests:
                  cpu: 10m
                  memory: 10Mi
          volumes:
            - name: alert-config
              configMap:
                name: alert-config
            - name: storage-volume
              persistentVolumeClaim:
                claimName: alertmanager
    

    configMap重载**是一个简单的二进制文件,用于在 Kubernetes ConfigMaps更新时触发重新加载。 它监视装载的卷目录,并通知目标进程配置映射已经更改。 目前它只支持发送HTTP请求,但在未来它期望支持发送操作系统( 比如 )。 SIGHUP ) 一旦Kubernetes支持pod名称空间

    alert-conf.yaml

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: alert-config
      namespace: kube-system
    data:
      config.yml: |-
        global:
          # 在没有报警的情况下声明为已解决的时间
          resolve_timeout: 5m
          # 配置邮件发送信息
          smtp_smarthost: 'smtp.163.com:25'
          smtp_from: ''
          smtp_auth_username: ''
          smtp_auth_password: '' #授权密码
          smtp_hello: '163.com'
          smtp_require_tls: false
        # 所有报警信息进入后的根路由,用来设置报警的分发策略
        route:
          # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
          group_by: ['alertname', 'cluster']
          # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
          group_wait: 30s
    
          # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
          group_interval: 5m
    
          # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
          repeat_interval: 5m
    
          # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
          receiver: default
    
          # 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
          routes:
          - receiver: email
            group_wait: 10s
            match:
              team: node
        receivers:
        - name: 'default'
          email_configs:
          - to: '810553413@qq.com'
            send_resolved: true
        
        - name: 'email'
          email_configs:
          - to: '810553413@qq.com'
            send_resolved: true
    

    alert-pvc.yaml

    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: alertmanager
    spec:
      capacity:
        storage: 2Gi
      accessModes:
      - ReadWriteOnce
      persistentVolumeReclaimPolicy: Recycle
      #storageClassName: managed-nfs-storage  #storageClassName与prometheus-statefulset.yaml中volumeClaimTemplates下定义的需要保持一致
      nfs:
        server: 192.168.2.7
        path: /data/k8s
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: alertmanager
      namespace: kube-system
      labels:
        kubernetes.io/cluster-service: "true"
        addonmanager.kubernetes.io/mode: EnsureExists
    spec:
      # 使用自己的动态PV
      #storageClassName: managed-nfs-storage 
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: "2Gi"
    

    alert-svc.yaml

    apiVersion: v1
    kind: Service
    metadata:
      name: alertmanager
      namespace: kube-system
      labels:
        kubernetes.io/cluster-service: "true"
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/name: "Alertmanager"
    spec:
      ports:
        - name: http
          port: 80
          protocol: TCP
          targetPort: 9093
      selector:
        k8s-app: alertmanager
      type: NodePort
    
    kubectl create -f alert-pvc.yaml
    kubectl create -f alert-conf.yaml 
    kubectl create -f alert-deploy.yaml 
    kubectl create -f alert-svc.yaml 
    

    启动报错

    alert-deploy路径问题

    1586357155716

    kubectl logs -f alertmanager-7d854bcbdf-7kh4k -n kube-system -c prometheus-alertmanager
    

    1586357955540

    根据报错信息确定失败原因

    1586358215993

    level=info ts=2020-04-08T04:08:34.170520801Z caller=main.go:322 msg="Loading configuration file" file=/etc/config/alertmanager.yml
    level=error ts=2020-04-08T04:08:34.170585226Z caller=main.go:325 msg="Loading configuration file failed" file=/etc/config/alertmanager.yml err="open /etc/config/alertmanager.yml: no such file or directory"
    

    alert-pvc 创建问题

    default-scheduler  pod has unbound immediate PersistentVolumeClaims (repeated 2 times)
    

    alert-svc 配置问题

    The Service "alertmanager" is invalid: spec.ports[1].name: Duplicate value: "http"
    

    配置Prometheus与Alertmanager通信

    编辑 prometheus-configmap.yaml 配置文件添加绑定信息

    alerting:
      alertmanagers:
        - static_configs:
          - targets: ["alertmanager:80"]
    
    [root@k8s-master prometheus]# kubectl get svc --all-namespaces
    NAMESPACE     NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
    kube-system   alertmanager                   NodePort    10.110.206.170   <none>        80:30199/TCP             72m
    kube-system   prometheus                     NodePort    10.111.194.39    <none>        9090:31611/TCP           93d
    

    alertmanager控制台

    1586321607922

    prometheus控制台

    查看配置是否生效

    1586321677979

    配置告警

    编辑 prometheus.configmap.yaml 添加报警信息

        # 添加:指定读取rules配置
        rules_files:
        - /etc/config/rules/*.rules
    
    
    kubectl apply -f prometheus.configmap.yaml
    

    故障 无法访问

    prometheus无法访问,耐心排查,可以先deploy挂载到容器rules目录,在添加报警信息

    1586349896304

    编辑报警规则

    prometheus-rules.yaml

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-rules
      namespace: kube-system
    data:
      general.rules: |
        groups:
        - name: general.rules
          rules:
          - alert: InstanceDown
            expr: up == 0
            for: 1m
            labels:
              severity: error
            annotations:
              summary: "Instance {{ $labels.instance }} 停止工作"
              description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."
      node.rules: |
        groups:
        - name: node.rules
          rules:
          - alert: NodeFilesystemUsage
            expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
              description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"
    
          - alert: NodeMemoryUsage
            expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 20
            for: 1m
            labels:
              severity: warning
              team: node
            annotations:
              summary: "Instance {{ $labels.instance }} 内存使用率过高"
              description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"
    
          - alert: NodeCPUUsage
            expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: "Instance {{ $labels.instance }} CPU使用率过高"
              description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"
    
    [root@k8s-master prometheus]# kubectl get cm --all-namespaces
    NAMESPACE     NAME                                 DATA   AGE
    kube-public   cluster-info                         1      152d
    kube-system   alert-config                         1      9h
    kube-system   coredns                              1      152d
    kube-system   extension-apiserver-authentication   6      152d
    kube-system   kube-flannel-cfg                     2      152d
    kube-system   kube-proxy                           2      152d
    kube-system   kubeadm-config                       2      152d
    kube-system   kubelet-config-1.16                  1      152d
    kube-system   prometheus-blackbox-exporter         1      29d
    kube-system   prometheus-config                    1      15m
    kube-system   prometheus-rules                     2      6h58m
    

    configmap挂载到容器rules目录

    修改挂载点位置,使用之前部署的prometheus.deploy动态PV

    volumeMounts:
    # 添加:指定rules的configmap配置文件名称
    - name: prometheus-rules
      mountPath: /etc/config/rules
      subPath: ""
      
    volumes:
    # 添加:name rules
      - name: prometheus-rules
      # 添加:配置文件
        configMap:
        # 添加:定义文件名称
          name: prometheus-rules
    

    更改config文件需重启prometheus

    创建configmap并更新PV
    kubectl apply -f prometheus-rules.yaml
    #如果prometheus.deploy更新失败,可以先删除
    kubectl delete -f prometheus.deploy.yaml
    kubectl apply -f prometheus.deploy.yaml 
    

    存储服务器

    [root@localhost ~]# cat /etc/exports
    /data/k8s  192.168.2.0/24(rw,no_root_squash,sync)
    

    查看alerts告警规则

    1586350513741

    访问alerts管理后台

    1586350857189

    我们可以看到页面中出现了我们刚刚定义的报警规则信息,而且报警信息中还有状态显示。一个报警信息在生命周期内有下面3种状态:

    • inactive: 表示当前报警信息既不是firing状态也不是pending状态
    • pending: 表示在设置的阈值时间范围内被激活了
    • firing: 表示超过设置的阈值时间被激活了

    模拟报警

    修改报警规则 prometheus-rules.yaml

    修改后 热更新 kubectl apply -f prometheus-rules.yaml

    报警接收

    1586412594407

    邮件接收

    1586416727652

    无法发送邮件

    先后重启了DNS以及POD,之后恢复了原因不明

    注意事项

    team:node 标签一定要一致,否则alert无法收到报警

      routes:
      - receiver: email
        group_wait: 10s
        match:
          team: node
    
      - alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 20
        for: 1m
        labels:
          severity: warning
          team: node
        annotations:
          summary: "Instance {{ $labels.instance }} 内存使用率过高"
          description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"
    
  • 相关阅读:
    读入输出优化模板
    HDU-2647 Reward(拓扑排序)
    HDU-2647 Reward(拓扑排序)
    HDU-2647 Reward(拓扑排序)
    HDU-2647 Reward(拓扑排序)
    Using KafkaBolt to write to a kafka topic
    Using KafkaBolt to write to a kafka topic
    Using KafkaBolt to write to a kafka topic
    Using KafkaBolt to write to a kafka topic
    getElementById() 获取指定ID的第一个元素
  • 原文地址:https://www.cnblogs.com/orange-lsc/p/12825591.html
Copyright © 2011-2022 走看看