zoukankan      html  css  js  c++  java
  • [专题论文阅读]【分布式DNN训练系统】 FireCaffe

    Forrest N. Iandola FireCaffe: near-linear acceleration of deep neural network training on computer clusters 2016.1

    Problem statements from data scientists

    4 key pain points summarized by Jeff Dean from Google:

    1. DNN researchers and users want results of experiments quickly.

    2. There is a “patience threshold”: No one wants to wait more than a few days or a week for result.

    3. This significantly affects scale of problems that can be tackled.

    4. We sometimes optimize for experiment turnaround time, rather than absolute minimal system resources for performing the experiments

    Alt text

    Problem analysis

    The speed and scalability of distributed algorithm are almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule.
    So the design focuses on the communication enhancement, including:

    1. Upgrade to high throughput interconnects, i.e. use high throughput interconnections like IB etc.
    2. Decrease the data transmission volume while training, which includes:
      a) Balance carefully between data parallelism and model parallelism
      b) Increase batch size to reduce communication quantity. And identify hyperparameters suitable for large batch size.
      c) Communication data quantity balance among nodes to avoid single point dependency.

    Key take-aways

    Parallelism Scheme: Model parallelism or Data Parallelism



    Model parallelism

    Each worker gets a subset of the model parameters, and the workers communication by exchanging data gradients and exchanging activations . and data quantity is:

    Data parallelism

    Each worker gets a subset of the batch, and then the workers communicate by exchanging weight gradient updates , where and data quantity is:

    Convolution layer and fully connection layer have different characteristics in data/weight ratio. So they can use different parallelism schemes.

    Alt text
    So a basic conclusion is: convolution layers can be fitted into data parallelism, and fc layers can be fitted into model parallelism.
    Further more, for more advanced CNNs like GoogLeNet and ResNet etc., we can directly use data parallelism, as this paper is using.

    Gradient Aggregation Scheme: Parameter Server or Reduction Tree

    One picture to show how parameter server and reduction tree work in data parallelism.

    gradient aggregation scheme

    Parameter Server

    Parameter communication time with regard to worker number in parameter server scheme.

    The communication time scales linearly as we increase the number of workers. single parameter server becomes scalability bottleneck.
    Microsoft Adam and Google DistBelief relief this issue by defining a poll of nodes taht colelctively behave as a parameter server. The bigger the parameter server hierarchy gets, the more it looks like a reduction tree.

     

    Reduction Tree

    The idea is same as allreduce in message passing model. Parameter communication time with regard to worker number in reduction tree scheme.

    It scales logrithmatically as the number of workers.

     

    Alt text

    Batch size selection

    Larger batch size lead to less frequent communication and therefore enable more scalability in a distributed setting. But for larger batch size, we need identify a suitable hyperparameter setting to maintain the speed and accuracy produced in DNN training.
    Hyperparameters includes:

      1. Initial learning rate

      2. learning rate update scheme

      3. weight delay

      4. momentum

    Weight update rule used, here means iteration index:  

     
    Learning rate update rule: 

     

    On how to get hyperparameters according to batch size, I will write another article for this. 

    Results

    Final results on GPU cluster w/ GoogleNet.

    Alt text

    More thinking

      1. 以上方案基本上是无损的,为了更进一步减少通信开销,大家开始尝试有损的方案,在训练速度和准确度之间进行折衷。典型的有: 

         1). Reduce parameter size using 16-bit floating-point - Google
           2). Use 16-bit weights and 8-bit activations.
             3). 1-bit gradients backpropagation - Microsoft
               4). Discard gradients whose numerical values fall below a certain threshold - Amazon
                 5). Compress(e.g. using PCA) weights before transmitting
                   6). Network pruning/encoding/quantization - Intel, DeePhi
              2. 使用新的底层技术来减少通信开销 - Matrix
                   1) RDMA rather than traditional TCP/IP?
            • 相关阅读:
              初学 python 之 多级菜单实现原理
              初学 python 之 用户登录实现过程
              Sqlmap使用教程
              对伪静态网站实施注射
              干货!IT小伙伴们实用的网站及工具大集合!持续更新!
              lnmp、lamp、lnmpa一键安装包(Updated: 2016-4-12)
              如何入门 Python 爬虫?
              在Windows Live Writer中插入C# code
              IIS装好后,局域网不能访问
              修改win7登录界面
            • 原文地址:https://www.cnblogs.com/Matrix_Yao/p/5975386.html
            Copyright © 2011-2022 走看看