zoukankan      html  css  js  c++  java
  • 针对PPO的一些Code-level性能优化技巧

    Intro

    这篇blog是我在看过Logan等人的“implementation matters in deep policy gradients: a case study on ppo and trpo“之后的总结。

    reward clipping

    • clip the rewards within a preset range( usually [-5,5] or [-10,10])

    observation clipping

    • The state are first normalized to mean-zero, variance-one vectors

    value function clipping

    (Loss^{V} = (V_{ heta t} - V_{targ})^{2})替换为(L^{V} = min[ (V_{ heta t} - V_{targ})^{2} , (clip(V_{ heta t}, V_{ heta t-1}-epsilon, V_{ heta t-1}+epsilon) - V_{targ})^{2} ])

    orthogonal initialization and layer scaling

    use orthogonal initialization with scaling that varies from layer to layer

    adam learning rate annealing

    anneal the learning rate of Adam

    hyperbolic tan activations

    use hyperbolic tan activations when constructing the policy network and value network

    global gradient clipping

    clip the gradients such the 'global l2 norm' doesn't exceed 0.5

    reward scaling

  • 相关阅读:
    记忆化搜索 E
    网络流 O
    线段树 B数据结构 牛客练习赛28
    N
    线段树 G
    K
    F
    补一下昨天的博客 J
    selenium-1-python
    selenium入门知识
  • 原文地址:https://www.cnblogs.com/dynmi/p/14031724.html
Copyright © 2011-2022 走看看