zoukankan      html  css  js  c++  java
  • kubernetes(k8s) Prometheus+grafana监控告警安装部署

    主机数据收集

    主机数据的采集是集群监控的基础;外部模块收集各个主机采集到的数据分析就能对整个集群完成监控和告警等功能。一般主机数据采集和对外提供数据使用cAdvisor 和node-exporter等工具。

    cAdvisor

    概述

    Kubernetes的生态中,cAdvisor是作为容器监控数据采集的Agent,其部署在每个节点上,内部代码结构大致如下:代码结构很良好,collector和storage部分基本可做到增量扩展开发。

     
    cAdvisor.png

    关于cAdvisor支持自定义指标方式能力,其自身是通过容器部署的时候设置lable标签项:io.cadvisor.metric.开头的lable,而value则为自定义指标的配置文件,形如下:

    {
      "endpoint" : {
        "protocol": "https",
        "port": 8000,
        "path": "/nginx_status"
      },
      "metrics_config"  : [
        { "name" : "activeConnections",
          "metric_type" : "gauge",
          "units" : "number of active connections",
          "data_type" : "int",
          "polling_frequency" : 10,
          "regex" : "Active connections: ([0-9]+)"
        },
        { "name" : "reading",
          "metric_type" : "gauge",
          "units" : "number of reading connections",
          "data_type" : "int",
          "polling_frequency" : 10,
          "regex" : "Reading: ([0-9]+) .*"
        },
        { "name" : "writing",
          "metric_type" : "gauge",
          "data_type" : "int",
          "units" : "number of writing connections",
          "polling_frequency" : 10,
          "regex" : ".*Writing: ([0-9]+).*"
        },
        { "name" : "waiting",
          "metric_type" : "gauge",
          "units" : "number of waiting connections",
          "data_type" : "int",
          "polling_frequency" : 10,
          "regex" : ".*Waiting: ([0-9]+)"
        }
      ]
    }
    

    但kubernetes1.6开始 cAdvisor 被集成到kubernetes中,可以在安装kubernetes通过参数激活 cAdvisor

    当前cAdvisor只支持http接口方式,也就是被监控容器应用必须提供http接口,所以能力较弱,如果我们在collector这一层做扩展增强,提供数据库,mq等等标准应用的监控模式是很有价值的。在此之前的另一种方案就是如上图所示搭配promethuese(其内置有非常丰富的标准应用的插件涵盖了APM所需的采集大部分插件),但是这往往会导致系统更复杂(如果应用层并非想使用promethuse)

    在Kubernetes监控生态中,一般是如下的搭配使用:


     
    cAdvisor-promethus.png

    Node-exporter

    概述

    node-exporter 运行在节点上采集节点主机本身的cpu和内存等使用信息,并对外提供获取主机性能开销的信息。

    部署

    下面是node-exporter在k8s下的部署文件

    apiVersion: v1
    kind: Service
    metadata:
      annotations:
        prometheus.io/scrape: 'true'
      labels:
        app: node-exporter
        name: node-exporter
      name: node-exporter
      namespace: kube-system
    spec:
      clusterIP: None
      ports:
      - name: scrape
        port: 9100
        protocol: TCP
      selector:
        app: node-exporter
      type: ClusterIP
    ---
    apiVersion: extensions/v1beta1
    kind: DaemonSet
    metadata:
      name: node-exporter
      namespace: kube-system
    spec:
      template:
        metadata:
          labels:
            app: node-exporter
          name: node-exporter
        spec:
          containers:
          - image: prom/node-exporter:latest
            name: node-exporter
            ports:
            - containerPort: 9100
              hostPort: 9100
              name: scrape
          hostNetwork: true
          hostPID: true
          restartPolicy: Always
    

    监控

    完成对kubernetes的监控, 监控收集数据一般有PULL和PUSH两种方式。PULL方式是监控平台从集群中的主机上主动拉取采集到的主机信息,而PUSH方式是主机将采集到的信息推送到监控平台。常用的监控平台是Prometheus,是采用PULL的方式采集主机信息。

    Prometheus

    概述

    Prometheus 是源于 Google Borgmon 的一个系统监控和报警工具,用 Golang 语言开发。基本原理是通过 HTTP 协议周期性地抓取被监控组件的状态(pull 方式),这样做的好处是任意组件只要提供 HTTP 接口就可以接入监控系统,不需要任何 SDK 或者其他的集成过程。

    这样做非常适合虚拟化环境比如 VM 或者 Docker ,故其为为数不多的适合 Docker、Mesos 、Kubernetes 环境的监控系统之一,被很多人称为下一代监控系统。

    特性

    • 自定义多维度的数据模型
    • 非常高效的存储 平均一个采样数据占 ~3.5 bytes左右,320万的时间序列,每30秒采样,保持60天,消耗磁盘大概228G。
    • 强大的查询语句
    • 轻松实现数据可视化

    优点

    • 非常少的外部依赖,安装使用超简单
    • 已经有非常多的系统集成 例如:docker HAProxy Nginx JMX等等
    • 服务自动化发现
    • 直接集成到代码
    • 设计思想是按照分布式、微服务架构来实现的

    组件

    • Prometheus server
      主要负责数据采集和存储,提供 PromQL 查询语言的支持;
    • Push Gateway
      支持临时性 Job 主动推送指标的中间网关;
    • Exporters
      提供被监控组件信息的 HTTP 接口被叫做 exporter ,目前互联网公司常用的组件大部分都有 exporter 可以直接使用,比如 Varnish、Haproxy、Nginx、MySQL、Linux 系统信息 (包括磁盘、内存、CPU、网络等等);
    • PromDash
      使用 rails 开发的 dashboard,用于可视化指标数据;
    • WebUI
      9090 端口提供的图形化功能;
    • Alertmanager
      用来进行报警;
    • APIclients
      提供 HTTPAPI 接口
     
    Prometheus

    部署

    • 安装Rbac
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: prometheus
    rules:
    - apiGroups: [""]
      resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
      verbs: ["get", "list", "watch"]
    - apiGroups:
      - extensions
      resources:
      - ingresses
      verbs: ["get", "list", "watch"]
    - nonResourceURLs: ["/metrics"]
      verbs: ["get"]
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: prometheus
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: prometheus
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: prometheus
    subjects:
    - kind: ServiceAccount
      name: prometheus
      namespace: kube-system
    
    • 安装Configmap
    # cat configmap.yaml 
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      namespace: kube-system
    data:
      prometheus.yml: |
        global:
          scrape_interval:     15s
          evaluation_interval: 15s
        scrape_configs:
    
        - job_name: 'kubernetes-apiservers'
          kubernetes_sd_configs:
          - role: endpoints
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
    
        - job_name: 'kubernetes-nodes'
          kubernetes_sd_configs:
          - role: node
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics
    
        - job_name: 'kubernetes-cadvisor'
          kubernetes_sd_configs:
          - role: node
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    
        - job_name: 'kubernetes-service-endpoints'
          kubernetes_sd_configs:
          - role: endpoints
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::d+)?;(d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name
    
        - job_name: 'kubernetes-services'
          kubernetes_sd_configs:
          - role: service
          metrics_path: /probe
          params:
            module: [http_2xx]
          relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
            action: keep
            regex: true
          - source_labels: [__address__]
            target_label: __param_target
          - target_label: __address__
            replacement: blackbox-exporter.example.com:9115
          - source_labels: [__param_target]
            target_label: instance
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            target_label: kubernetes_name
    
        - job_name: 'kubernetes-ingresses'
          kubernetes_sd_configs:
          - role: ingress
          relabel_configs:
          - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
            regex: (.+);(.+);(.+)
            replacement: ${1}://${2}${3}
            target_label: __param_target
          - target_label: __address__
            replacement: blackbox-exporter.example.com:9115
          - source_labels: [__param_target]
            target_label: instance
          - action: labelmap
            regex: __meta_kubernetes_ingress_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_ingress_name]
            target_label: kubernetes_name
    
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
          - role: pod
          relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::d+)?;(d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name
    
    • 部署 Prometheus
    ---
    apiVersion: apps/v1beta2
    kind: Deployment
    metadata:
      labels:
        name: prometheus-deployment
      name: prometheus
      namespace: kube-system
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: prometheus
      template:
        metadata:
          labels:
            app: prometheus
        spec:
          containers:
          - image: prom/prometheus:v2.0.0
            name: prometheus
            command:
            - "/bin/prometheus"
            args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus"
            - "--storage.tsdb.retention=24h"
            ports:
            - containerPort: 9090
              protocol: TCP
            volumeMounts:
            - mountPath: "/prometheus"
              name: data
            - mountPath: "/etc/prometheus"
              name: config-volume
            resources:
              requests:
                cpu: 100m
                memory: 100Mi
              limits:
                cpu: 500m
                memory: 2500Mi
          serviceAccountName: prometheus    
          volumes:
          - name: data
            emptyDir: {}
          - name: config-volume
            configMap:
              name: prometheus-config 
    
    
    ---
    kind: Service
    apiVersion: v1
    metadata:
      labels:
        app: prometheus
      name: prometheus
      namespace: kube-system
    spec:
      type: NodePort
      ports:
      - port: 9090
        targetPort: 9090
        nodePort: 30003
      selector:
        app: prometheus
    

    测试

    访问集群中 http://prometheus地址:30003后。在Graph页面
    输入:

    node_cpu
    

    查询命令可以看到节点cpu的使用信息。prometheus监控节点信息成功。
    访问targets页面可以看到prometheus采集的监控信息的来源。

    告警

    Prometheus的告警是使用AlertManger来一同完成的。Prometheus在监控信息超过设定阀值时就将告警信息发送给AlertManger模块,AlertManger模块负责告警。

    AlertManger

    概述

    Alertmanager与Prometheus是相互分离的两个组件。Prometheus服务器根据报警规则将警报发送给Alertmanager,然后Alertmanager将silencing、inhibition、aggregation等消息通过电子邮件、PaperDuty和HipChat发送通知。

    设置警报和通知的主要步骤:

    • 安装配置Alertmanager
    • 配置Prometheus通过-alertmanager.url标志与Alertmanager通信
    • 在Prometheus中创建告警触发规则。
    • 在Alertmanager中设置告警通知规则

    告警通知规则

    Alertmanager处理由例如Prometheus服务器等客户端发来的警报。它负责删除重复数据、分组,并将警报通过路由发送到正确的接收器,比如电子邮件、Slack等。Alertmanager还支持groups,silencing和警报抑制的机制。

    分组

    分组是指将同一类型的警报分类为单个通知。当许多系统同时宕机时,很有可能成百上千的警报会同时生成,这种机制特别有用。
    例如,当数十或数百个服务的实例在运行,网络发生故障时,有可能一半的服务实例不能访问数据库。在prometheus告警规则中配置为每一个服务实例都发送警报的话,那么结果是数百警报被发送至Alertmanager。

    但是作为用户只想看到单一的报警页面,同时仍然能够清楚的看到哪些实例受到影响,因此,可以通过配置Alertmanager将警报分组打包,并发送一个相对看起来紧凑的通知。

    分组警报、警报时间,以及接收警报的receiver是在alertmanager配置文件中通过路由树配置的。

    抑制(Inhibition)

    抑制是指当警报发出后,停止重复发送由此警报引发其他错误的警报的机制。(比如网络不可达,导致其他服务连接相关警报)

    例如,当整个集群网络不可达,此时警报被触发,可以事先配置Alertmanager忽略由该警报触发而产生的所有其他警报,这可以防止通知数百或数千与此问题不相关的其他警报。

    抑制机制也是通过Alertmanager的配置文件来配置。

    沉默(Silences)

    Silences是一种简单的特定时间不告警的机制。silences警告是通过匹配器(matchers)来配置,就像路由树一样。传入的警报会匹配RE,如果匹配,将不会为此警报发送通知。
    silences报警机制可以通过Alertmanager的Web页面进行配置。

    接收

    使用Receiver定义各种通知用户的途径,告警经过分组,过滤处理后选择匹配的通知渠道发送给接收用户。

    部署

    报警触发规则

    定义报警规则
    报警规则通过以下格式定义:

    ALERT <alert name>
      IF <expression>
      [ FOR <duration> ]
      [ LABELS <label set> ]
      [ ANNOTATIONS <label set> ]
    
    • 可选的FOR语句,使得Prometheus在表达式输出的向量元素(例如高HTTP错误率的实例)之间等待一段时间,将警报计数作为触发此元素。如果元素是active,但是没有firing的,就处于pending状态。

    • LABELS(标签)语句允许指定一组标签附加警报上。将覆盖现有冲突的任何标签,标签值也可以被模板化。

    • ANNOTATIONS(注释)它们被用于存储更长的其他信息,例如警报描述或者链接,注释值也可以被模板化。

    • Templating(模板) 标签和注释值可以使用控制台模板进行模板化。labels变量保存警报实例的标签键/值对,value保存警报实例的评估值。

    # To insert a firing element's label values:
    {{ $labels.<labelname> }}
    # To insert the numeric expression value of the firing element:
    {{ $value }}
    

    例子:
    prometheus.yml中配置Prometheus和AlertManager通信的通信方式以及告警触发规则。

    在prometheus.yml中配置Prometheus来和AlertManager通信

    alerting:
      alertmanagers:
        - static_configs:
          - targets: ["alertManager域名:9093"]
    

    在prometheus.yml中指定匹配报警规则的间隔

    # How frequently to evaluate rules.
    [ evaluation_interval: <duration> | default = 1m ]
    

    在prometheus.yml中指定规则文件(可使用通配符,如rules/*.rules)

    rule_files:
        - /etc/prometheus/rules.yml
    

    其中rule_files就是用来指定报警规则的,这里我们将rules.yml用ConfigMap的形式挂载到/etc/prometheus目录下面即可:

    rules.yml: |
        groups:
        - name: test-rule
          rules:
          - alert: NodeFilesystemUsage
            expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
            for: 2m
            labels:
              team: node
            annotations:
              summary: "{{$labels.instance}}: High Filesystem usage detected"
              description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
          - alert: NodeMemoryUsage
            expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
            for: 2m
            labels:
              team: node
            annotations:
              summary: "{{$labels.instance}}: High Memory usage detected"
              description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
          - alert: NodeCPUUsage
            expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80
            for: 2m
            labels:
              team: node
            annotations:
              summary: "{{$labels.instance}}: High CPU usage detected"
              description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
    

    配置文件设置好后,需要让prometheus重新读取,有两种方法:

    告警通知规则

    全局
    要指定加载的配置文件,需要使用-config.file标志。该文件使用YAML来完成,通过下面的描述来定义。带括号的参数表示是可选的,对于非列表的参数的值,将被设置为指定的缺省值。

    通用占位符定义解释:

    <duration> : 与正则表达式匹配的持续时间值,[0-9]+(ms|[smhdwy])
    <labelname>: 与正则表达式匹配的字符串,[a-zA-Z_][a-zA-Z0-9_]*
    <labelvalue>: unicode字符串
    <filepath>: 有效的文件路径
    <boolean>: boolean类型,true或者false
    <string>: 字符串
    <tmpl_string>: 模板变量字符串
    

    global全局配置文件参数在所有配置上下文生效,作为其他配置项的默认值,可被覆盖.

    global:
          resolve_timeout: 30s
          smtp_smarthost: "smtp.163.com:25"
          smtp_from: 'xiyanxiyan10@163.com'
          smtp_auth_username: "xiyanxiyan10@163.com" 
          smtp_auth_password: "xiyanxiyan10" 
          smtp_require_tls: false
    

    路由(route)
    路由块定义了路由树及其子节点。如果没有设置的话,子节点的可选配置参数从其父节点继承。

    每个警报都会在配置的顶级路由中进入路由树,该路由树必须匹配所有警报(即没有任何配置的匹配器)。然后遍历子节点。如果continue的值设置为false,它在第一个匹配的子节点之后就停止;如果continue的值为true,警报将继续进行后续子节点的匹配。如果警报不匹配任何节点的任何子节点(没有匹配的子节点,或不存在),该警报基于当前节点的配置处理。

        route:
          receiver: mailhook
          group_wait: 30s
          group_interval: 1m
          repeat_interval: 1m
          group_by: [NodeMemoryUsage, NodeCPUUsage, NodeFilesystemUsage]
    
          routes:
          - receiver: mailhook
            group_wait: 10s
            match:
              team: node
    

    设置alertmanager.yml的的route与receivers

    route:
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      group_by: ['alertname']
    
      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first 
      # notification.
      group_wait: 5s
    
      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 1m
    
      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 3h 
    
      # A default receiver
      receiver: mengyuan
    
    receivers:
    - name: 'mengyuan'
      webhook_configs:
      - url: http://192.168.0.53:8080
      email_configs:
      - to: 'xiyanxiyan10@hotmail.com'
    

    路由配置格式

    #报警接收器
    [ receiver: <string> ]
    
    #分组
    [ group_by: '[' <labelname>, ... ']' ]
    
    # Whether an alert should continue matching subsequent sibling nodes.
    [ continue: <boolean> | default = false ]
    
    # A set of equality matchers an alert has to fulfill to match the node.
    #根据匹配的警报,指定接收器
    match:
      [ <labelname>: <labelvalue>, ... ]
    
    # A set of regex-matchers an alert has to fulfill to match the node.
    match_re:
    #根据匹配正则符合的警告,指定接收器
      [ <labelname>: <regex>, ... ]
    
    # How long to initially wait to send a notification for a group
    # of alerts. Allows to wait for an inhibiting alert to arrive or collect
    # more initial alerts for the same group. (Usually ~0s to few minutes.)
    [ group_wait: <duration> ]
    
    # How long to wait before sending notification about new alerts that are
    # in are added to a group of alerts for which an initial notification
    # has already been sent. (Usually ~5min or more.)
    [ group_interval: <duration> ]
    
    # How long to wait before sending a notification again if it has already
    # been sent successfully for an alert. (Usually ~3h or more).
    [ repeat_interval: <duration> ]
    
    # Zero or more child routes.
    routes:
      [ - <route> ... ]
    

    例子:

    // Match does a depth-first left-to-right search through the route tree
    // and returns the matching routing nodes.
    func (r *Route) Match(lset model.LabelSet) []*Route {
    Alert
    Alert是alertmanager接收到的报警,类型如下。
    
    // Alert is a generic representation of an alert in the Prometheus eco-system.
    type Alert struct {
        // Label value pairs for purpose of aggregation, matching, and disposition
        // dispatching. This must minimally include an "alertname" label.
        Labels LabelSet `json:"labels"`
    
        // Extra key/value information which does not define alert identity.
        Annotations LabelSet `json:"annotations"`
    
        // The known time range for this alert. Both ends are optional.
        StartsAt     time.Time `json:"startsAt,omitempty"`
        EndsAt       time.Time `json:"endsAt,omitempty"`
        GeneratorURL string    `json:"generatorURL"`
    }
    

    具有相同Lables的Alert(key和value都相同)才会被认为是同一种。在prometheus rules文件配置的一条规则可能会产生多种报警

    抑制规则 inhibit_rule

    抑制规则,是存在另一组匹配器匹配的情况下,使其他被引发警报的规则静音。这两个警报,必须有一组相同的标签。

    抑制配置格式

    # Matchers that have to be fulfilled in the alerts to be muted.
    ##必须在要需要静音的警报中履行的匹配者
    target_match:
      [ <labelname>: <labelvalue>, ... ]
    target_match_re:
      [ <labelname>: <regex>, ... ]
    
    # Matchers for which one or more alerts have to exist for the
    # inhibition to take effect.
    #必须存在一个或多个警报以使抑制生效的匹配者。
    source_match:
      [ <labelname>: <labelvalue>, ... ]
    source_match_re:
      [ <labelname>: <regex>, ... ]
    
    # Labels that must have an equal value in the source and target
    # alert for the inhibition to take effect.
    #在源和目标警报中必须具有相等值的标签才能使抑制生效
    [ equal: '[' <labelname>, ... ']' ]
    

    接收器(receiver)

    顾名思义,警报接收的配置。
    通用配置格式

    # The unique name of the receiver.
    name: <string>
    
    # Configurations for several notification integrations.
    email_configs:
      [ - <email_config>, ... ]
    pagerduty_configs:
      [ - <pagerduty_config>, ... ]
    slack_config:
      [ - <slack_config>, ... ]
    opsgenie_configs:
      [ - <opsgenie_config>, ... ]
    webhook_configs:
      [ - <webhook_config>, ... ]
    邮件接收器email_config
    
    # Whether or not to notify about resolved alerts.
    #警报被解决之后是否通知
    [ send_resolved: <boolean> | default = false ]
    
    # The email address to send notifications to.
    to: <tmpl_string>
    # The sender address.
    [ from: <tmpl_string> | default = global.smtp_from ]
    # The SMTP host through which emails are sent.
    [ smarthost: <string> | default = global.smtp_smarthost ]
    
    # The HTML body of the email notification.
    [ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ] 
    
    # Further headers email header key/value pairs. Overrides any headers
    # previously set by the notification implementation.
    [ headers: { <string>: <tmpl_string>, ... } ]
    
    

    Slcack接收器slack_config

    # Whether or not to notify about resolved alerts.
    [ send_resolved: <boolean> | default = true ]
    
    # The Slack webhook URL.
    [ api_url: <string> | default = global.slack_api_url ]
    
    # The channel or user to send notifications to.
    channel: <tmpl_string>
    
    # API request data as defined by the Slack webhook API.
    [ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
    [ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
    [ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
    [ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
    [ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
    [ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ]
    [ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]
    

    Webhook接收器webhook_config

     # Whether or not to notify about resolved alerts.
    [ send_resolved: <boolean> | default = true ]
    
     # The endpoint to send HTTP POST requests to.
    url: <string>
    

    Alertmanager会使用以下的格式向配置端点发送HTTP POST请求:

    {
      "version": "3",
      "groupKey": <number>     // key identifying the group of alerts (e.g. to deduplicate)
      "status": "<resolved|firing>",
      "receiver": <string>,
      "groupLabels": <object>,
      "commonLabels": <object>,
      "commonAnnotations": <object>,
      "externalURL": <string>,  // backling to the Alertmanager.
      "alerts": [
        {
          "labels": <object>,
          "annotations": <object>,
          "startsAt": "<rfc3339>",
          "endsAt": "<rfc3339>"
        },
        ...
      ]
    }
    

    例子:

     receivers:
        - name: mailhook
          email_configs:
          - to: "xiyanxiyan10@hotmail.com"
            html: '{{ template "alert.html" . }}'
            headers: { Subject: "[WARN] Warn Email" }
    

    集群信息展示

    集群的整体信息已经收集汇总在Prometheus中,但Prometheus主要是对外提供数据获取接口,并不负责完成完善的图形展示,因此需要使用DashBoard工具对接Prometheus完成集群信息的图形化展示.

    Grafana

    概述

    grafana 是一款采用 go 语言编写的开源应用,主要用于大规模指标数据的可视化展现,基于商业友好的 Apache License 2.0 开源协议。grafana有热插拔控制面板和可扩展的数据源,目前已经支持绝大部分常用的时序数据库。

    目前grafana支持的数据源

    支持的数据源

    上文已经提到,Grafana支持很多的数据源,主要支持的有如下数据源:

    1. Graphite
    2. Elasticsearch
    3. CloudWatch
    4. InfluxDB
    5. OpenTSDB
    6. Prometheus

    部署

    部署服务

    # cat grafana-deploy.yaml 
    
    ---
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      name: grafana-core
      namespace: kube-system
      labels:
        app: grafana
        component: core
    spec:
      replicas: 1
      template:
        metadata:
          labels:
            app: grafana
            component: core
        spec:
          containers:
          - image: grafana/grafana:4.2.0
            name: grafana-core
            imagePullPolicy: IfNotPresent
            # env:
            resources:
              # keep request = limit to keep this container in guaranteed class
              limits:
                cpu: 100m
                memory: 100Mi
              requests:
                cpu: 100m
                memory: 100Mi
            env:
              # The following env variables set up basic auth twith the default admin user and admin password.
              - name: GF_AUTH_BASIC_ENABLED
                value: "true"
              - name: GF_AUTH_ANONYMOUS_ENABLED
                value: "false"
              # - name: GF_AUTH_ANONYMOUS_ORG_ROLE
              #   value: Admin
              # does not really work, because of template variables in exported dashboards:
              # - name: GF_DASHBOARDS_JSON_ENABLED
              #   value: "true"
            readinessProbe:
              httpGet:
                path: /login
                port: 3000
              # initialDelaySeconds: 30
              # timeoutSeconds: 1
            volumeMounts:
            - name: grafana-persistent-storage
              mountPath: /var
          volumes:
          - name: grafana-persistent-storage
            emptyDir: {}
    --- 
    apiVersion: v1
    kind: Service
    metadata:
      name: grafana
      namespace: kube-system
      labels:
        app: grafana
        component: core
    spec:
      type: NodePort
      ports:
        - port: 3000
          targetPort: 3000
          nodePort: 30009
      selector:
        app: grafana
    

    配置grafana

    访问Grafana部署的端口可以看到以下页面

     
     

    默认用户名和密码都是admin

     
     

    配置数据源为prometheus

     
     导入面板,可以直接输入模板编号315在线导入,或者下载好对应的json模板文件本地导入,面板模板下载地址https://grafana.com/dashboards/315
     
     

    导入面板之后就可以看到对应的监控数据了。


     
     

    示例配置

    配置源文件

  • 相关阅读:
    20172314 2017-2018-2 《程序设计与数据结构》第七周学习总结
    20172314 2017-2018-2 《程序设计与数据结构》第六周学习总结
    20172314 2017-2018-2 《程序设计与数据结构》第5周学习总结
    20172314 2017-2018-2 《程序设计与数据结构》实验报告一
    20172314 2017-2018-2 《程序设计与数据结构》 第三周学习总结
    20172314 2017-2018-2 《程序设计与数据结构》第一周学习总结
    预备作业03
    学号 2017-2018-20172309 《程序设计与数据结构》第3周学习总结
    # 学号 2017-2018-20172309 《程序设计与数据结构》实验1报告
    第二 周作业总结
  • 原文地址:https://www.cnblogs.com/sunsky303/p/11276349.html
Copyright © 2011-2022 走看看