zoukankan      html  css  js  c++  java
  • prometheus实战:

    一:安装部分:

    https://prometheus.io/download/ ###下载源码解压即可

    https://grafana.com/grafana/dashboards ###搜索数据源为prometheus的

    这里下载了:prometheus、node_exporter、alertmanager、pushgateway

    同时机器需要安装docker

    yum install docker -y
    systemctl start docker.service

    安装gragana:

    wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.3.4-1.x86_64.rpm
    yum localinstall grafana-5.3.4-1.x86_64.rpm 
    systemctl start grafana-server

    二 :配置:

    1、prometheus配置:

    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      scrape_timeout: 10s     
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
           - 127.0.0.1:9093
    rule_files:
      - "./rules/rule_*.yml"
      # - "second_rules.yml"
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'  ###这个必须配置,这个地址抓取的所有数据会自动加上`job=prometheus`的标签
        # metrics_path defaults to '/metrics' #抓取监控目标的路径,默认是/metrics 可以根据自己业务的需要进行修改
        # scheme defaults to 'http'.
        static_configs:   #这是通过静态文件的配置方法:这种方法直接指定要抓去目标的ip和端口
        - targets: ['localhost:9090']
      - job_name: gateway
        static_configs: 
        - targets: ['127.0.0.1:9091']
          labels:   ## 打上标签,instance会被指定为‘gataway’
            instance: gataway
      - job_name: node_export
        file_sd_configs: 
          #refresh_interval: 1m #刷新发现文件的时间间隔
          - files:
            - /data/prometheus-2.12.0.linux-amd64/node_discovery.json
      - job_name: mysql_discovery
        file_sd_configs: 
          #refresh_interval: 1m #刷新发现文件的时间间隔
          - files:
            - /data/prometheus-2.12.0.linux-amd64/mysql_discovery.json
      - job_name: redis_discovery
        file_sd_configs: 
          #refresh_interval: 1m #刷新发现文件的时间间隔
          - files:
            - /data/prometheus-2.12.0.linux-amd64/redis_discovery.json

    各种discovery的模版为:

    [{"targets": ["127.0.0.1:9100"],"labels": {"instance": "test","idc": "beijing"}},{"targets": ["127.0.0.1:9101"],"labels": {"instance": "test2","idc": "beijing"}}]


    2、alertmanager配置:

    global:
      resolve_timeout: 5m
    #templates:
    #  - 'demo.tmpl'
    route:
      receiver: webhook
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 10m
      group_by: ['alertname']
      routes:
      - receiver: webhook
        group_wait: 10s
        match:
          job_name: mysql|kubernetes
      - receiver: 'webhook-kafka'
        group_by: [instance, alertname]
        match_re:
          instance: ^kafka-(.*)
    receivers:
    - name: webhook
      webhook_configs:
      - url: http://localhost:8060/dingtalk/ops_dingding/send 
        send_resolved: true
    - name: webhook-kafka
      webhook_configs:
      - url: http://localhost:8062/dingtalk/ops_59/send
        send_resolved: true


    ####备注:尽量使用动态发现的配置,以免配置文件过长

    - job_name: mysql_discovery
    file_sd_configs:
    #refresh_interval: 1m #刷新发现文件的时间间隔
    - files:
    - /data/prometheus-2.12.0.linux-amd64/redis_discovery.json

    三:启动服务:

    1、promethues的server启动:

    nohup sh start.sh 2>&1 > prometheus.log &

    --start.sh内容为

    ./prometheus --storage.tsdb.path=./data --storage.tsdb.retention.time=168h --web.enable-lifecycle --storage.tsdb.no-lockfil


    2、alertmanager的启动:

    nohup sh start.sh 2>&1 > alertmanager.log &

    --start.sh内容为

    ./alertmanager --config.file="alertmanager.yml" 

    其余服务的启动

    nohup ./node_exporter &
    nohup ./pushgateway &
    --------------------------
    启动云数据库的采集服务:
    docker run -d
      -p 9104:9104
      -e DATA_SOURCE_NAME="user:password@(url)/"
      prom/mysqld-exporter

    docker run -d
      -p 9121:9121
      -e REDIS_ADDR="redis://url:port"
      -e REDIS_PASSWORD="password"
      oliver006/redis_exporter
    如果有多组 只要修改外部暴漏端口和连接信息等等就可以了


    -----报警规则配置:

    rule_node.yml

    groups:
        - name: 主机状态-监控告警
          rules:
          - alert: 主机状态
            expr: up == 0
            for: 1m
            labels:
              status: 非常严重
            annotations:
              summary: "{{$labels.instance}}:服务器宕机"
              description: "{{$labels.instance}}:服务器延时超过5分钟"
          
          - alert: CPU使用情况
            expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
            for: 1m
            labels:
              status: 一般告警
            annotations:
              summary: "{{$labels.mountpoint}} CPU使用率过高!"
              description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"
      
          - alert: 内存使用
            expr: 100 -(node_memory_MemTotal_bytes -node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes ) / node_memory_MemTotal_bytes * 100> 80
            for: 1m
            labels:
              status: 严重告警
            annotations:
              summary: "{{$labels.mountpoint}} 内存使用率过高!"
              description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"
          - alert: IO性能
            expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
            for: 1m
            labels:
              status: 严重告警
            annotations:
              summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
              description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
     
          - alert: 网络
            expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
            for: 1m
            labels:
              status: 严重告警
            annotations:
              summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
              description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"
     
          - alert: 网络
            expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
            for: 1m
            labels:
              status: 严重告警
            annotations:
              summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
              description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"
          
          - alert: TCP会话
            expr: node_netstat_Tcp_CurrEstab > 1000
            for: 1m
            labels:
              status: 严重告警
            annotations:
              summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
              description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
     
          - alert: 磁盘容量
            expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
            for: 1m
            labels:
              status: 严重告警
            annotations:
              summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
              description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"


    rule_mysql.yml

    groups:
    - name: MySQLStatsAlert
      rules:
      - alert: MySQL is down
        expr: up{job="mysql-discorvery"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} MySQL is down"
          description: "MySQL database is down. This requires immediate action!"
      - alert: Read buffer size is bigger than max. allowed packet size
        expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet 
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"
          description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
      - alert: Sort buffer possibly missconfigured
        expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"
          description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
      - alert: Thread stack size is too small
        expr: mysql_global_variables_thread_stack <196608
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Thread stack size is too small"
          description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
      - alert: Used more than 90% of max connections limited 
        expr: mysql_global_status_threads_connected > mysql_global_variables_max_connections * 0.8
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"
          description: "Used more than 80% of max connections limited"
      - alert: InnoDB Force Recovery is enabled
        expr: mysql_global_variables_innodb_force_recovery != 0 
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"
          description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
      - alert: InnoDB Log File size is too small
        expr: mysql_global_variables_innodb_log_file_size < 16777216 
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"
          description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
      - alert: Table definition cache too small
        expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} Table definition cache too small"
          description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
      - alert: Thread stack size is possibly too small
        expr: mysql_global_variables_thread_stack < 262144
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"
          description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
      - alert: InnoDB Plugin is enabled
        expr: mysql_global_variables_ignore_builtin_innodb == 1
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"
          description: "InnoDB Plugin is enabled"
      - alert: Binary Log is disabled
        expr: mysql_global_variables_log_bin != 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Binary Log is disabled"
          description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
      - alert: IO thread stopped
        expr: mysql_slave_status_slave_io_running != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} IO thread stopped"
          description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
      - alert: SQL thread stopped 
        expr: mysql_slave_status_slave_sql_running == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} SQL thread stopped"
          description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
      - alert: SQL thread stopped
        expr: mysql_slave_status_slave_sql_running != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
          description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
      - alert: Slave lagging behind Master
        expr: rate(mysql_slave_status_seconds_behind_master[1m]) >30 
        for: 1m
        labels:
          severity: warning 
        annotations:
          summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
          description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"


      - alert: Instance has slow logs
        expr: irate(mysql_global_status_slow_queries[5m]) > 10
        for: 1m
        labels:
          severity: worning
        annotations:
          summary: "Instance {{ $labels.instance }} has slow log"
          description: "slow log"

    rule_redis.yml:

    groups:
    - name: MySQLStatsAlert
      rules:
      - alert: redis is down
        expr: up{job="redis-discorvery"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} MySQL is down"
          description: "MySQL database is down. This requires immediate action!"
      - alert: redis memory alert
        expr: 100 * (redis_memory_used_bytes{instance !~ "pro-sas|pro-redis-dun"}  / redis_config_maxmemory{instance !~ "pro-sas|pro-redis-dun"} ) > 90
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} memory userd over 90"
          description: "redis memory short"

    写在最后:

    prometheus监控k8s:

    http://note.youdao.com/noteshare?id=dbbc868d32835cab5eac5a455df243ed

    prometheus钉钉报警:

    https://github.com/timonwong/prometheus-webhook-dingtalk


  • 相关阅读:
    HashMap
    java反射
    arraylist和linkedlist区别
    int和Integer的区别
    java 数组排序并去重
    矩阵链乘法问题
    找零问题
    硬币收集问题
    最大借书量问题
    钢条切割问题
  • 原文地址:https://www.cnblogs.com/wenyule/p/13650887.html
Copyright © 2011-2022 走看看