zoukankan      html  css  js  c++  java
  • promethus监控gpu并编写自定义grafana可视化页面模板

    ###监控gpu

    url:https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm
    实际操作:
    docker run --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter
    
    需要做以下操作docker才可以启动:
    # Add the package repositories
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | 
      sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | 
      sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update
    
    # Install nvidia-docker2 and reload the Docker daemon configuration
    sudo apt-get install -y nvidia-docker2
    sudo systemctl daemon-reload #重新读取配置文件
    sudo systemctl restart docker    #重启docker服务
    sudo pkill -SIGHUP dockerd        #未知
    预执行命令:
    $ docker run --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter
    # The output of dcgmi discovery and nvidia-smi should be same.
    $ docker exec nvidia-dcgm-exporter dcgmi discovery -i a -v | grep -c 'GPU ID:'
    $ nvidia-smi -L | wc -l
    #这里可以看gpu方式来查看一些数据
    url:https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm
    #这里我做了映射将数据映射到本地来
    mkdir -p /usr/local/prometheus   #创建了需要存放监控到gpu的数值在机器上
    

    ###选用本机的9100node_exporter端口

    docker tag nvidia/dcgm-exporter nvidia-dcgm-exporter
    docker run -d --runtime=nvidia --rm --name=nvidia-dcgm-exporter -v /run/prometheus:/run/prometheus nvidia-dcgm-exporter
    或者
    docker run -d --rm --cap-add=sys_admin --runtime=nvidia --name=nvidia-dcgm-exporter -v /run/prometheus:/run/prometheus nvidia-dcgm-exporter -p
    docker run -d --rm --net="host" --pid="host"  quay.io/prometheus/node-exporter --collector.textfile.directory="/run/prometheus"
    查看9090端口的promethus是否有dcgm接口
    如果有的话 那么gpu监控完成了 接着找grafana的gpu模板吧
    

    ###自定义grafana模板 dcgm_board_limit_violation dcgm_dec_utilization dcgm_enc_utilization dcgm_fb_free dcgm_fb_used dcgm_gpu_temp# dcgm_gpu_utilization dcgm_low_util_violation dcgm_mem_copy_utilization dcgm_memort_clock dcgm_pcie_replay_counter dcgm_pcie_rx_throughput dcgm_pcie_tx_throughput
    dcgm_power_usage dcgm_power_violation dcgm_reliability_violation dcgm_sm_clock dcgm_sync_boost_violation dcgm_thermal_violation dcgm_total_energy_consumption dcgm_xid_errors

  • 相关阅读:
    互联网创业的葵花宝典
    null和undefined的区别
    mpc0.9编译方法
    gmp5.0.5编译
    为iphone及iphone simulator编译poco库
    binutils2.22编译心得
    为iphone及iphone simulator编译qt库
    sql server之触发器调用C#CLR存储过程实现两个表的数据同步
    poco之HttpRequest之get方法
    poco之HttpRequest之post方法
  • 原文地址:https://www.cnblogs.com/sxgaofeng/p/12036229.html
Copyright © 2011-2022 走看看