zoukankan      html  css  js  c++  java
  • Horovod in Docker

    https://horovod.readthedocs.io/en/stable/docker.html

    Step1 构建镜像

    GPU

    $ mkdir horovod-docker-gpu
    $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
    $ docker build -t horovod:latest horovod-docker-gpu
    

    CPU

    $ mkdir horovod-docker-gpu
    $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
    $ docker build -t horovod:latest horovod-docker-cpu
    

    在单机上运行

    GPU 的机器,可以使用 nvidia-docker.

    $ nvidia-docker run -it horovod:latest
    root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py
    

    在多机上运行

    (一)多机运行的条件:ssh免密登陆

    http://www.linuxproblem.org/art_9.html

    1. First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:
    a@A:~> ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/a/.ssh/id_rsa): 
    Created directory '/home/a/.ssh'.
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /home/a/.ssh/id_rsa.
    Your public key has been saved in /home/a/.ssh/id_rsa.pub.
    The key fingerprint is:
    3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A
    
    1. Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist, which is fine):
    a@A:~> ssh b@B mkdir -p .ssh
    b@B's password: 
    
    1. Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time:
    a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
    b@B's password: 
    
    1. From now on you can log into B as b from A as a without password:
    a@A:~> ssh b@B
    

    (二)主worker

    host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
    root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py
    

    (三)从 workers:

    host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    
    host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    
    host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    

    支持远程直接数据存储

    $ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
    root@c278c88dd552:/examples# ...
    
  • 相关阅读:
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
    Tomcat跨域
    Invalid bean definition with name 'dataSource' defined in class path resource [applicationContext.xml]
    网速测试
    程序员实用工具网站
    安装wls报(主清单位置 "/u01/app/oracle/inventory" 无效 (无法读取/写入/执行))
    pom.xml
    CUDA -- 内存分配
    最长上升子序列(LIS: Longest Increasing Subsequence)
    实例化渲染
  • 原文地址:https://www.cnblogs.com/shix0909/p/13391019.html
Copyright © 2011-2022 走看看