zoukankan      html  css  js  c++  java
  • Horovod-Usage

    Usage

    代码中要包含以下6步:

    1. 初始化
    Run hvd.init() to initialize Horovod.
    
    1. 将每个GPU固定到单个进程以避免资源争用。
      一个线程一个GPU,设置到 local rank ,第一个线程将分配给第一个GPU。第二个线程将分配给第二个GPU 向每个 TensorFlow 进程分配一个 GPU
    config = tf.ConfigProto()
    config.gpu_options.visible_device_list = str(hvd.local_rank())
    
    1. 根据worker的数量,来确定学习率
    loss = ...
    opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
    
    1. 使用 Horovod 优化器包裹每一个常规 TensorFlow 优化器,Horovod 优化器使用 ring-allreduce 平均梯度
    opt = hvd.DistributedOptimizer(opt)
    
    1. 将变量从第一个流程向其他流程传播,以实现一致性初始化. 从 rank 0 广播到所有的线程
    hooks = [hvd.BroadcastGlobalVariablesHook(0)]
    
    1. 将checkpoints 保存在worker0上
    with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                           config=config,
                                           hooks=hooks) as mon_sess:
    
    import tensorflow as tf
    import horovod.tensorflow as hvd
    
    
    # Initialize Horovod
    hvd.init()
    
    # Pin GPU to be used to process local rank (one GPU per process)
    config = tf.ConfigProto()
    config.gpu_options.visible_device_list = str(hvd.local_rank())
    
    # Build model...
    loss = ...
    opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
    
    # Add Horovod Distributed Optimizer
    opt = hvd.DistributedOptimizer(opt)
    
    # Add hook to broadcast variables from rank 0 to all other processes during
    # initialization.
    hooks = [hvd.BroadcastGlobalVariablesHook(0)]
    
    # Make training operation
    train_op = opt.minimize(loss)
    
    # Save checkpoints only on worker 0 to prevent other workers from corrupting them.
    checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
    
    # The MonitoredTrainingSession takes care of session initialization,
    # restoring from a checkpoint, saving to a checkpoint, and closing when done
    # or an error occurs.
    with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                           config=config,
                                           hooks=hooks) as mon_sess:
      while not mon_sess.should_stop():
        # Perform synchronous training.
        mon_sess.run(train_op)
    
  • 相关阅读:
    用番茄工作法提升工作效率 (二)用番茄钟实现劳逸结合(简单到不可相信)
    Scratch少儿编程系列:(九)音乐高级技巧
    Scratch少儿编程系列:(八)演奏简单音乐
    BOM (字节顺序标记)
    获取 / 设置 进程的工作目录(当前目录)
    内存对齐
    WPF 透明窗体
    C# 调用 C++ 的 DLL 返回值为 bool 时,值混乱
    WPF ListView / ListBox 更新绑定数据源时,自动刷新界面显示
    正则表达式——WPF输入控件TextBox 限定输入特定字符
  • 原文地址:https://www.cnblogs.com/shix0909/p/13391003.html
Copyright © 2011-2022 走看看