zoukankan      html  css  js  c++  java
  • Prometheus+node_exporter+alertmanager+prometheus_webhook_dingtalk+Grafana(非容器搭建)简单搭建监控报警平台笔记

    一、搭建目的;

    通过搭建过程,了解目前流行的监控系统。

    二、搭建环境;

    虚机

    三、搭建配置调试过程;

    1、prometheus相关安装包下载地址;https://prometheus.io/download/

    2、grafana下载地址;https://grafana.com/grafana/download

    3、安装

    (1)、下载并解压安装prometheus(网上搜索教程,本笔记省略);配置prometheus并启动prometheus;

      prometheus.yml

    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
    # Alertmanager configuration
    #  - job_name: 'Alertmanager'
    #    static_configs:
    
            #- alertmanager:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
        - targets: ['localhost:9090']

    *注意targets为什么不用服务器ip而是用localhost因为如果用服务器ip的话,一旦服务器ip变了就无法使用*

    启动prometheus命令进入安装目录 ./prometheus --config.file=prometheus.yml &

    netstat –tpln可以看到已经 监听9090端口,可以通过ip:9090访问prometheus;

    (2)、安装启动node_exporter(网上搜索教程,本笔记省略);并接入到prometheus;

    启动node_exporter;进入安装目录 ./node_exporter &

    netstat –tpln可以看到已经 监听9100端口

    修改prometheus;并重启prometheus查看ip:9090上node_exporter服务是否接入并up成功;

    prometheus.yml;

    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    #  - job_name: 'Alertmanager'
    #    static_configs:
        
            #- alertmanager:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
        - targets: ['localhost:9090']
      - job_name: 'node_self'
        scheme: http
        #tls_config:
          #ca_file: node_exporter.crt
        static_configs:
        - targets: ['localhost:9100']
    

    重启prometheus在ip:9090看到如下图表示正常

    image

    image

    (3)、安装配置alertmanager+prometheus_webhook_dingtalk完成报警收集与报警消息推送到钉钉;修改prometheus配置接入alertmanager并添加报警规则rules.yml

    安装,启动prometheus_webhook_dingtalk

    启动prometheus_webhook_dingtalk;进入安装目录;nohup ./prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx" --ding.profile="dev_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxx2"  2>&1 1>dingding.log &
    说明:1、https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxx2https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx为钉钉自己创建机器人接口。webhook可惟在启动时指定多个机器人(注意在webhook中的—ding.profile命名不能相同;一个为ops_dingding;一个为dev_dingding);

    启动后默认监听8060端口;

    (4)、配置alertmanager.yml并启动alertmanager服务;

    alertmanager.yml

    route:
      group_by: ['alertname']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 1h
      receiver: 'web.hook'
      routes:
      - receiver: 'test.yaya'
        match:
          priority: P0
        continue: true
      - receiver: 'web.hook'
        match:
          priority: P0
        continue: true
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://127.0.0.1:8060/dingtalk/ops_dingding/send'
    #inhibit_rules:
      #- source_match:
        #  severity: 'critical'
        #target_match:
        #  severity: 'warning'
        #equal: ['alertname', 'dev', 'instance']
    - name: 'test.yaya'
      webhook_configs:
      - url: 'http://127.0.0.1:8060/dingtalk/dev_dingding/send'
    #inhibit_rules:
    #  - source_match:
    #      severity: 'critical'
    #    target_match:
    #      severity: 'warning'
    #    equal: ['alertname','dev', 'instance']

    *注意routes中的报警方式test.yaya和web.hook如果没有continue:true那么在第一个报警匹配之后不会再运行后台其它匹配的报警;url为报警的prometheus_webhook_dingtalk的接口;两个不同的机器人ops_dingding和dev_dingding*

    进入安装目录 ;运行./alertmanager --config.file=alertmanager.yml &;监控9093端口配置服务正常启动。

    配置prometheus.yml接入alertmanager

    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
          - targets: ['localhost:9093']
    #  - job_name: 'Alertmanager'
    #    static_configs:
    
          - targets:
            #- alertmanager:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "rules.yml"
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
        - targets: ['localhost:9090']
      - job_name: 'node_self'
        scheme: http
        #tls_config:
          #ca_file: node_exporter.crt
        static_configs:
        - targets: ['localhost:9100']

    *注意rule_files指定了报警规则文件;目录默认为prometheus安装目录 *

    rules.yml

    groups:
    - name: "服务报警测试"
      rules:
      - alert: "内存服务报警"
        expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 40
        for: 1m
        labels:
          #token: {{ .Values.prometheus.prometheusSpec.externalLabels.env }}-bigdata
          priority: P0
          status: 告警
        annotations:
          description: "大数据告警:IPadress:{{$labels.instance}} 内存使用大于48%(目前使用:{{$value}}%)"
          summary: "大数据告警:CPU使用大于40%(目前使用:{{$value}}%)"

    *注意runle.yml中的node_memory_MemAvailable_bytes等参数为node_exporter收集参数,更多内容请问度娘*

    重启prometheus;

    web打开ip:9090

    image

    报警从pending到firing话的钉钉上收到报警信息表示正常。


    (5)安装grafana并图型node_export和push_gateway参数指定参数;

    安装grafana(自行百度);启动 systemctl start grafana;

    登录初始用户名/密码为admin/admin;

    安装后配置数据源为prometheus;下载node_exporter基本监控json文件导入granfa;可以完成node_exporter数据获取生成监控图。


    接入push_gateway数据自定义监控图;

    1、安装push_gateway;开启服务;监听9091

    自定义监控数据获取写入push_gateway;

    #!/bin/bash
    avl=`free -m|grep Mem|awk '{print $NF}'`
    total=`free -m|grep Mem|awk '{print $2}'`
    sum=$(printf "%.3f" `echo "scale=5;${avl}/${total}"|bc`)
    res=`echo "$sum * 100"|bc`
    #echo ${res}%
    echo "Mem_use_persent ${res}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/wx_job
    jk_disk.sh  test.sh     
    [root@test04 pushgateway-0.7.0.linux-amd64]# cat /wuxiao/jb/jk_disk.sh 
    res=`df -h|grep -E "/$"|awk '{print $5}'|awk -F"%" '{print $1}'`
    #echo ${res}
    echo "disk_jk_use_persent ${res}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/jk_disk_use

    配置prometheus.yml接入prometheus,并生启prometheus

    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
          - targets: ['localhost:9093']
    #  - job_name: 'Alertmanager'
    #    static_configs:
    
          - targets:
            #- alertmanager:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "rules.yml"
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
        - targets: ['localhost:9090']
      - job_name: 'node_self'  
        scheme: http
        #tls_config:
          #ca_file: node_exporter.crt
        static_configs: 
        - targets: ['localhost:9100']
      - job_name: 'pushgateway' 
        static_configs:
          - targets: ['localhost:9091']
            labels:
              instance: pushgateway

    登录grafana配置

    image

    image

    image

    *注意数据源和监控的数据要填对*

    image

    image

    配置倮保存即可

  • 相关阅读:
    Linux下调试caffe
    MXnet的使用
    Cmake的使用
    深度学习的移动端实现
    【WPF】面板布局介绍Grid、StackPanel、DockPanel、WrapPanel
    【WinForm】Dev ComboxEdit、BarManager的RepositoryItemComboBox、ComboBox操作汇总
    【WinForm】DataGridView使用小结
    【Linux】常用指令
    【c++】MFC 程序入口和执行流程
    【WPF】拖拽改变控件大小
  • 原文地址:https://www.cnblogs.com/wx2276/p/prometheus.html
Copyright © 2011-2022 走看看