zoukankan      html  css  js  c++  java
  • 基于浅层神经网络(全连接网络)的强化学习算法(Reinforce) 在训练过程中出现梯度衰退(degenerate)的现象

    首先给出一个代码地址:

    https://gitee.com/devilmaycry812839668/CartPole-PolicyNetwork

    强化学习中的策略网络算法。《TensorFlow实战》一书中强化学习部分的策略网络算法,仿真环境为gym的CartPole,本项目是对原书代码进行了部分重构,并加入了些中文注释,同时给出了30次试验的运行结果。

    =======================================

    可以看到,上面的代码是比较简单的Reinforce算法,其中策略函数使用浅层的三层神经网络(全连接),激活函数使用Relu,进行了30次试验,每次试验进行了10000 个episodes的训练,但是神奇的发现这30次试验中居然第5次试验,第21次试验出现了严重的梯度衰退的想象。

    给出梯度衰退时部分训练结果:

    Average reward for episode 1375 : 200.000000.
    Average reward for episode 1400 : 200.000000.
    Average reward for episode 1425 : 200.000000.
    Average reward for episode 1450 : 200.000000.
    Average reward for episode 1475 : 200.000000.
    Average reward for episode 1500 : 200.000000.
    Average reward for episode 1525 : 200.000000.
    Average reward for episode 1550 : 192.480000.
    Average reward for episode 1575 : 140.440000.
    Average reward for episode 1600 : 104.240000.
    Average reward for episode 1625 : 20.080000.
    Average reward for episode 1650 : 12.560000.
    Average reward for episode 1675 : 10.720000.
    Average reward for episode 1700 : 11.080000.
    Average reward for episode 1725 : 12.000000.
    Average reward for episode 1750 : 10.560000.
    Average reward for episode 1775 : 11.040000.
    Average reward for episode 1800 : 10.360000.
    Average reward for episode 1825 : 10.080000.
    Average reward for episode 1850 : 10.640000.
    Average reward for episode 1875 : 10.360000.
    Average reward for episode 1900 : 10.360000.
    Average reward for episode 1925 : 10.480000.
    Average reward for episode 1950 : 10.360000.
    Average reward for episode 1975 : 9.680000.
    Average reward for episode 2000 : 10.000000.
    Average reward for episode 2025 : 10.720000.
    Average reward for episode 2050 : 10.000000.
    Average reward for episode 2075 : 10.000000.
    Average reward for episode 2100 : 10.520000.
    Average reward for episode 2125 : 10.640000.
    Average reward for episode 2150 : 9.760000.
    Average reward for episode 2175 : 11.040000.
    View Code

    可以看到在第5次和第21次试验中当训练到一定episodes后训练结果下降到极坏的水平(远低于随机策略的结果,随机策略结果应该在26左右),因此我们可以发现这时的训练已经发生了梯度衰退问题,degenerate问题。以前一直以为衰退问题只会出现在深层网络中,没有想到在浅层网络中也发现了衰退现象。

    查阅相关论文《Skip connections eliminate signulairites》 发现浅层网络也是会出现衰退现象的,解答了自己的疑问,原来浅层神经网络也是可能会出现衰退问题的。

  • 相关阅读:
    图的连通性问题之tarjan算法
    图的连通性问题之强连通分量初步
    NOIP 2010 引水入城
    最短路经典例题 codevs 1557 热浪
    图的连通性问题之连通和最小环
    最短路径算法
    《数据结构与算法-Javascript描述》
    蓝天白云
    《慢慢来,一切都还来得及》
    聚餐
  • 原文地址:https://www.cnblogs.com/devilmaycry812839668/p/14097322.html
Copyright © 2011-2022 走看看