策略梯度方法

zoukankan html css js c++ java

策略梯度方法

该理解建立在Policy Gradient Methods for Reinforcement Learning with Function Approximation论文阅读理解之上

首先明确优化目标$ ho(pi)$,其中策略$pi$是包含参数$ heta$的未知函数，一般有两种形式。

一种为每步期望奖励：

[ ho ( pi ) = lim _ { n ightarrow infty } frac { 1 } { n } E left{ r _ { 1 } + r _ { 2 } + cdots + r _ { n } | pi ight} = sum _ { s } d ^ { pi } ( s ) sum _ { a } pi ( s , a ) mathcal { R } _ { s } ^ { a }]

另一种为由某一状态出发的累计奖励：

[ ho ( pi ) = E left{ sum _ { t = 1 } ^ { infty } gamma ^ { t - 1 } r _ { t } | s _ { 0 } , pi ight}]

为了优化策略函数Sutton在论文Policy Gradient Methods for Reinforcement Learning with Function Approximation中证明了无论上述哪种形式的目标函数，都有

[frac { partial ho } { partial heta } = sum _ { s } d ^ { pi } ( s ) sum _ { a } frac { partial pi ( s , a ) } { partial heta } Q ^ { pi } ( s , a )]

即表明梯度不涉及到平稳分布函数$d ^ { pi } ( s )$的导数。记$ ho = J ( oldsymbol { heta } )$,有：

egin{aligned} abla J ( oldsymbol { heta } ) & = mathbb { E } _ { pi } left[ sum _ { a } pi left( a | S _ { t } , oldsymbol { heta } ight) q _ { pi } left( S _ { t } , a ight) frac { abla _ { oldsymbol { heta } } pi left( a | S _ { t } , oldsymbol { heta } ight) } { pi left( a | S _ { t } , oldsymbol { heta } ight) } ight] \ & = mathbb { E } _ { pi } left[ q _ { pi } left( S _ { t } , A _ { t } ight) frac { abla _ { oldsymbol { heta } } pi left( A _ { t } | S _ { t } , oldsymbol { heta } ight) } { pi left( A _ { t } | S _ { t } , oldsymbol { heta } ight) } ight] \ & = mathbb { E } _ { pi } left[ G _ { t } frac { abla _ { oldsymbol { heta } } pi left( A _ { t } | S _ { t } , oldsymbol { heta } ight) } { pi left( A _ { t } | S _ { t } , oldsymbol { heta } ight) } ight] end{aligned}

其中第二个等式是通过抽样$A _ { t } sim pi$替换动作a，最后一个等式是因为$mathbb { E } _ { pi } left[ G _ { t } | S _ { t } , A _ { t } ight] = q _ { pi } left( S _ { t } , A _ { t } ight)$

因此策略梯度的更新法则为：

[oldsymbol { heta } _ { t + 1 } doteq oldsymbol { heta } _ { t } + alpha G _ { t } frac { abla _ { oldsymbol { heta } } pi left( A _ { t } | S _ { t } , oldsymbol { heta } _ { t } ight) } { pi left( A _ { t } | S _ { t } , oldsymbol { heta } _ { t } ight) }]

查看全文

相关阅读:
SQL SERVER 实现多个数据库之间表的联系，利用临时表枚举表中行数据
 [CCF CSP]201909-2 小明种苹果（续）
Anaconda 安装 Python 库（MySQLdb）
[CCF CSP]201903-4 消息传递接口
 [CCF CSP]201609-4 交通规划
 2019年12月CSP考试第三题化学方程式解法
 Leetcode.94.二叉树的中序遍历
 GENIA命名实体数据集解析代码
 git添加新用户
 C#语言十大经典排序算法动画与解析！（动态演示+代码）(java改写成C# )

原文地址：https://www.cnblogs.com/statruidong/p/10755683.html

策略梯度方法

该理解建立在Policy Gradient Methods for Reinforcement Learning with Function Approximation论文阅读理解之上