zoukankan      html  css  js  c++  java
  • Horovod in Docker

    https://horovod.readthedocs.io/en/stable/docker.html

    Step1 构建镜像

    GPU

    $ mkdir horovod-docker-gpu
    $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
    $ docker build -t horovod:latest horovod-docker-gpu
    

    CPU

    $ mkdir horovod-docker-gpu
    $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
    $ docker build -t horovod:latest horovod-docker-cpu
    

    在单机上运行

    GPU 的机器,可以使用 nvidia-docker.

    $ nvidia-docker run -it horovod:latest
    root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py
    

    在多机上运行

    (一)多机运行的条件:ssh免密登陆

    http://www.linuxproblem.org/art_9.html

    1. First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:
    a@A:~> ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/a/.ssh/id_rsa): 
    Created directory '/home/a/.ssh'.
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /home/a/.ssh/id_rsa.
    Your public key has been saved in /home/a/.ssh/id_rsa.pub.
    The key fingerprint is:
    3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A
    
    1. Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist, which is fine):
    a@A:~> ssh b@B mkdir -p .ssh
    b@B's password: 
    
    1. Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time:
    a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
    b@B's password: 
    
    1. From now on you can log into B as b from A as a without password:
    a@A:~> ssh b@B
    

    (二)主worker

    host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
    root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py
    

    (三)从 workers:

    host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    
    host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    
    host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    

    支持远程直接数据存储

    $ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
    root@c278c88dd552:/examples# ...
    
  • 相关阅读:
    python dataframe根据变量类型选取变量
    史上最简单的Xgboost安装教程 for Python3.7 on Win10!亲测有效!
    Python三种基础数据类型:列表list,元祖tuple和字典dict
    Time 模块
    第二周 3(实战:中国大学排名定向爬虫)
    第二周 2(信息标记与提取)
    第二周 1(beautiful soup库)
    第一周 2(requests库实战)
    第一周 1 (requests库)
    pd.concat()
  • 原文地址:https://www.cnblogs.com/shix0909/p/13391019.html
Copyright © 2011-2022 走看看