zoukankan      html  css  js  c++  java
  • host主机监控规则

    1.先在 Prometheus 主程序目录下创建rules目录,然后在该目录下创建 host.yml文件,内容如下:
    内容很多,可以根据实际情况进行调整。
    规则参考网址:https://awesome-prometheus-alerts.grep.to/rules

    参考网址的规则中,有些地方需要修改,比如:

      - alert: HostNetworkReceiveErrors
        expr: increase(node_network_receive_errs_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host Network Receive Errors (instance {{ $labels.instance }})"
          description: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"  ```
    
    具体使用的时候需要修改description,最外层的双引号修改成单引号,因为里面的`"%.0f"`已经使用双引号了。修改成:
    

    '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes. VALUE = {{ $value }} LABELS: {{ $labels }}'

    
    

    注意:目录和文件的权限:chown -R prometheus:prometheus rules

    ```yaml
    groups:
    - name: Host and hardware
      rules:
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host out of memory (instance {{ $labels.instance }})"
          description: "Node memory is filling up (< 10% left)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostMemoryUnderMemoryPressure
        expr: rate(node_vmstat_pgmajfault[1m]) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host memory under memory pressure (instance {{ $labels.instance }})"
          description: "The node is under heavy memory pressure. High rate of major page faults
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostUnusualNetworkThroughputIn
        expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host unusual network throughput in (instance {{ $labels.instance }})"
          description: "Host network interfaces are probably receiving too much data (> 100 MB/s)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostUnusualNetworkThroughputOut
        expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host unusual network throughput out (instance {{ $labels.instance }})"
          description: "Host network interfaces are probably sending too much data (> 100 MB/s)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostUnusualDiskReadRate
        expr: sum by (instance) (irate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host unusual disk read rate (instance {{ $labels.instance }})"
          description: "Disk is probably reading too much data (> 50 MB/s)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostUnusualDiskWriteRate
        expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host unusual disk write rate (instance {{ $labels.instance }})"
          description: "Disk is probably writing too much data (> 50 MB/s)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostOutOfDiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/rootfs"}  * 100) / node_filesystem_size_bytes{mountpoint="/rootfs"} < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host out of disk space (instance {{ $labels.instance }})"
          description: "Disk is almost full (< 10% left)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostDiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host disk will fill in 4 hours (instance {{ $labels.instance }})"
          description: "Disk will fill in 4 hours at current write rate
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostOutOfInodes
        expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint ="/rootfs"} * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host out of inodes (instance {{ $labels.instance }})"
          description: "Disk is almost running out of available inodes (< 10% left)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostUnusualDiskReadLatency
        expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host unusual disk read latency (instance {{ $labels.instance }})"
          description: "Disk latency is growing (read operations > 100ms)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostUnusualDiskWriteLatency
        expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host unusual disk write latency (instance {{ $labels.instance }})"
          description: "Disk latency is growing (write operations > 100ms)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostHighCpuLoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host high CPU load (instance {{ $labels.instance }})"
          description: "CPU load is > 80%
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostContextSwitching
        expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host context switching (instance {{ $labels.instance }})"
          description: "Context switching is growing on node (> 1000 / s)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostSwapIsFillingUp
        expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host swap is filling up (instance {{ $labels.instance }})"
          description: "Swap is filling up (>80%)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostSystemdServiceCrashed
        expr: node_systemd_unit_state{state="failed"} == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host SystemD service crashed (instance {{ $labels.instance }})"
          description: "SystemD service crashed
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostPhysicalComponentTooHot
        expr: node_hwmon_temp_celsius > 75
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host physical component too hot (instance {{ $labels.instance }})"
          description: "Physical hardware component too hot
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostNodeOvertemperatureAlarm
        expr: node_hwmon_temp_alarm == 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host node overtemperature alarm (instance {{ $labels.instance }})"
          description: "Physical node temperature alarm triggered
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostRaidArrayGotInactive
        expr: node_md_state{state="inactive"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host RAID array got inactive (instance {{ $labels.instance }})"
          description: "RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostRaidDiskFailure
        expr: node_md_disks{state="fail"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host RAID disk failure (instance {{ $labels.instance }})"
          description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostKernelVersionDeviations
        expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host kernel version deviations (instance {{ $labels.instance }})"
          description: "Different kernel versions are running
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostOomKillDetected
        expr: increase(node_vmstat_oom_kill[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host OOM kill detected (instance {{ $labels.instance }})"
          description: "OOM kill detected
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: HostEdacCorrectableErrorsDetected
        expr: increase(node_edac_correctable_errors_total[5m]) > 0
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Host EDAC Correctable Errors detected (instance {{ $labels.instance }})"
          description: '{{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.
      VALUE = {{ $value }}
      LABELS: {{ $labels }}'
      - alert: HostEdacUncorrectableErrorsDetected
        expr: node_edac_uncorrectable_errors_total > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})"
          description: '{{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.
      VALUE = {{ $value }}
      LABELS: {{ $labels }}'
      - alert: HostNetworkReceiveErrors
        expr: increase(node_network_receive_errs_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host Network Receive Errors (instance {{ $labels.instance }})"
          description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.
      VALUE = {{ $value }}
      LABELS: {{ $labels }}'
      - alert: HostNetworkTransmitErrors
        expr: increase(node_network_transmit_errs_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host Network Transmit Errors (instance {{ $labels.instance }})"
          description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.
      VALUE = {{ $value }}
      LABELS: {{ $labels }}'
      - alert: JvmMemoryFillingUp
        expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "JVM memory filling up (instance {{ $labels.instance }})"
          description: "JVM memory is filling up (> 80%)
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: SpeedtestSlowInternetDownload
        expr: avg_over_time(speedtest_download[30m]) < 75
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SpeedTest Slow Internet Download (instance {{ $labels.instance }})"
          description: "Internet download speed is currently {{humanize $value}} Mbps.
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
      - alert: SpeedtestSlowInternetUpload
        expr: avg_over_time(speedtest_upload[30m]) < 20 
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SpeedTest Slow Internet Upload (instance {{ $labels.instance }})"
          description: "Internet upload speed is currently {{humanize $value}} Mbps.
      VALUE = {{ $value }}
      LABELS: {{ $labels }}"
    

    2.在 Prometheus 主程序目录下的prometheus.yml进行修改,引用上述告警规则

    rule_files:
      - "rules/*.yml"
    

    3.重启Prometheus

  • 相关阅读:
    python3 TypeError: a bytes-like object is required, not 'str'
    Centos 安装Python Scrapy PhantomJS
    Linux alias
    Vim vimrc配置
    Windows下 Python Selenium PhantomJS 抓取网页并截图
    Linux sort
    Linux RSync 搭建
    SSH隧道 访问内网机
    笔记《鸟哥的Linux私房菜》7 Linux档案与目录管理
    Tornado 错误 "Global name 'memoryview' is not defined"
  • 原文地址:https://www.cnblogs.com/sanduzxcvbnm/p/13589848.html
Copyright © 2011-2022 走看看