zoukankan      html  css  js  c++  java
  • TensorFlow使用记录 (六): 优化器

    0. tf.train.Optimizer

    tensorflow 里提供了丰富的优化器,这些优化器都继承与 Optimizer 这个类。class Optimizer 有一些方法,这里简单介绍下:

    0.1. minimize

    minimize(
        loss,
        global_step=None,
        var_list=None,
        gate_gradients=GATE_OP,
        aggregation_method=None,
        colocate_gradients_with_ops=False,
        name=None,
        grad_loss=None
    )
    • loss: A Tensor containing the value to minimize.
    • global_step: Optional Variable to increment by one after the variables have been updated.
    • var_list: Optional list or tuple of Variable objects to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
    • gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, orGATE_GRAPH.
    • aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.
    • colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.
    • name: Optional name for the returned operation.
    • grad_loss: Optional. A Tensor holding the gradient computed for loss.

    0.2. compute_gradients

    compute_gradients(
        loss,
        var_list=None,
        gate_gradients=GATE_OP,
        aggregation_method=None,
        colocate_gradients_with_ops=False,
        grad_loss=None
    )

    这是优化 minimize() 的第一步,计算梯度,返回 (gradient, variable) 列表。

    0.3. apply_gradients

    apply_gradients(
        grads_and_vars,
        global_step=None,
        name=None
    )

    这是优化 minimize() 的第二步,返回一个执行梯度更新的 ops。

    TensorFlow使用记录 (八): 梯度修剪 就用到了这两个函数。

    1. tf.train.GradientDescentOptimizer

    __init__(
        learning_rate,
        use_locking=False,
        name='GradientDescent'
    )

     egin{equation}
    label{a}
    heta gets heta - eta abla_{ heta}J( heta)
    end{equation}

    标准的梯度下降法优化器。

    Recall that Gradient Descent simply updates the weights $ heta$ by directly subtracting the gradient of the cost function $J( heta)$ with regards to the weights ($ abla_{ heta}J( heta)$) multiplied by the learning rate $eta$. It does not care about what the earlier gradients were. If the local gradient is tiny, it goes very slowly.

    2. tf.train.MomentumOptimizer

    __init__(
        learning_rate,
        momentum,
        use_locking=False,
        name='Momentum',
        use_nesterov=False
    )

    Momentum optimization cares a great deal about what previous gradients were: at each iteration, it adds the local gradient to the momentum vector m (multiplied by the learning rate $eta$), and it updates the weights by simply subtracting this momentum vector.

    egin{equation}
    label{b}
    egin{split}
    & mathbf{m} gets eta mathbf{m} + eta abla_{ heta}J( heta) \
    & heta gets heta - mathbf{m}
    end{split}
    end{equation}

    调用方式:

    optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9)

    除了标准的 MomentumOptimizer 外,还有一个变体 Nesterov Accelerated Gradient:

    The idea of Nesterov Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum.

    egin{equation}
    label{c}
    egin{split}
    &  mathbf{m} gets  eta mathbf{m} + eta abla_{ heta}J( heta + eta mathbf{m}) \
    &  heta gets heta -  mathbf{m}
    end{split}
    end{equation}

    调用方式:

    optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9, use_nesterov=True)

    3. tf.train.AdagradOptimizer

    __init__(
        learning_rate,
        initial_accumulator_value=0.1,
        use_locking=False,
        name='Adagrad'
    )

    egin{equation}
    label{d}
    egin{split}
    &  mathbf{s} gets mathbf{s} +    abla_{ heta}J( heta) otimes  abla_{ heta}J( heta) \
    &  heta gets heta - eta abla_{ heta}J( heta) oslash sqrt{mathbf{s} + epsilon}
    end{split}
    end{equation}

    The first step accumulates the square of the gradients into the vector $mathbf{s}$ (the $otimes$ symbol represents the element-wise multiplication). This vectorized form is equivalent to computing $s_i  gets s_i + (partial / partial  heta_i J( heta))^2$ for each element $s_i$ of the vector $mathbf{s}$; in other words, each $s_i$ accumulates the squares of the partial derivative of the cost function with regards to parameter $ heta_i$. If the cost function is steep along the ith dimension, then $s_i$ will get larger and larger at each iteration.

    The second step is almost identical to Gradient Descent, but with one big difference: the gradient vector is scaled down by a factor of $sqrt{mathbf{s} + epsilon}$ (the $oslash$ symbol represents the element-wise division, and $epsilon$ is a smoothing term to avoid division by zero, typically set to $10^{-10}$). This vectorized form is equivalent to computing $θ_i gets θ_i − eta partial / partial θ_i J(θ) / sqrt{mathbf{s_i} + epsilon}$ for all parameters $ heta_i$ (simultaneously).

    In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate. It helps point the resulting updates more directly toward the global optimum. One additional benefit is that it requires much less tuning of the learning rate hyperparameter $eta$.

     

     调用方式:

    optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)

    不推荐使用:

    AdaGrad often performs well for simple quadratic problems, but unfortunately it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum. So even though TensorFlow has an AdagradOptimizer, you should not use it to train deep neural networks (it may be efficient for simpler tasks such as Linear Regression, though).

    4. tf.train.RMSPropOptimizer

    __init__(
        learning_rate,
        decay=0.9,
        momentum=0.0,
        epsilon=1e-10,
        use_locking=False,
        centered=False,
        name='RMSProp'
    )

    Although AdaGrad slows down a bit too fast and ends up never converging to the global optimum, the RMSProp algorithm14 fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step.

    egin{equation}
    label{e}
    egin{split}
    &  mathbf{s} gets eta mathbf{s} +  (1 - eta) abla_{ heta}J( heta) otimes  abla_{ heta}J( heta) \
    &  heta gets heta - eta abla_{ heta}J( heta) oslash sqrt{mathbf{s} + epsilon}
    end{split}
    end{equation}

    The decay rate $eta$ is typically set to 0.9. Yes, it is once again a new hyperparameter, but this default value often works well, so you may not need to tune it at all.

    调用方式:

    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate,
                                          momentum=0.9, decay=0.9, epsilon=1e-10)

    Except on very simple problems, this optimizer almost always performs much better than AdaGrad. It also generally performs better than Momentum optimization and Nesterov Accelerated Gradients. In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around.

    5. tf.train.AdamOptimizer

    __init__(
        learning_rate=0.001,
        beta1=0.9,
        beta2=0.999,
        epsilon=1e-08,
        use_locking=False,
        name='Adam'
    )

     Adam, which stands for adaptive moment estimation, combines the ideas of Momentum optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared gradients

    egin{equation}
    label{f}
    egin{split}
    & mathbf{m} gets eta_1 mathbf{m} + (1 - eta_1) abla_{ heta}J( heta) \
    & mathbf{s} gets eta_2 mathbf{s} + (1 - eta_2) abla_{ heta}J( heta) otimes abla_{ heta}J( heta) \
    & mathbf{m} gets frac{mathbf{m}}{1 - eta_1^t} \
    & mathbf{s} gets frac{mathbf{s}}{1 - eta_2^t} \
    & heta gets heta - eta mathbf{m} oslash sqrt{mathbf{s} + epsilon}
    end{split}
    end{equation}

    $t$ is time step. The momentum decay hyperparameter $eta_1$ is typically initialized to 0.9, while the scaling decay hyperparameter $eta_2$ is often initialized to 0.999. As earlier, the smoothing term $epsilon$ is usually initialized to a tiny number such as $10^{–8}$

    调用方式:

    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

    6. tf.train.FtrlOptimizer

    __init__(
        learning_rate,
        learning_rate_power=-0.5,
        initial_accumulator_value=0.1,
        l1_regularization_strength=0.0,
        l2_regularization_strength=0.0,
        use_locking=False,
        name='Ftrl',
        accum_name=None,
        linear_name=None,
        l2_shrinkage_regularization_strength=0.0
    )

    See this paper. This version has support for both online L2 (the L2 penalty given in the paper above) and shrinkage-type L2 (which is the addition of an L2 penalty to the loss function).

  • 相关阅读:
    Mac上Homebrew的安装
    Nodejs全局/缓存路径配置
    Windows 10文件夹Shirt+鼠标右键出现“在此处打开命令窗口”
    CentOS 7上VNCServer的安装使用
    照葫芦画瓢系列之Java --- eclipse下使用maven创建Struts 2项目
    照葫芦画瓢系列之Java --- Maven的集成和使用
    关于集合常见面试问题
    Linux 性能分析大概步骤
    java中的scanner用法
    分享一个内存溢出的问题
  • 原文地址:https://www.cnblogs.com/xuanyuyt/p/11646971.html
Copyright © 2011-2022 走看看