zoukankan      html  css  js  c++  java
  • 强化学习(九):策略梯度

    Policy Gradient Methods

    之前学过的强化学习几乎都是所谓的‘行动-价值’方法,也就是说这些方法先是学习每个行动在特定状态下的价值,之后在每个状态,根据当每个动作的估计价值进行选择。这种方法可看成是一种‘间接’的方法,因为强化学习的目标是如何决策,这些方法把每个动作的价值作为指标,来辅助决策。这类方法是一种很直观的,很容易理解的思维方式。当然还有另一种更‘直接’的方法,即不使用辅助手段(如之前的价值函数),直接学习决策。这种方法更直接,因为当学习结束得到的就是如何决策,但这种方法却不太直观,有价值函数作为助手,基本上不存在模型解释性问题,哪个价值大选哪个,很容易理解。直接学习策略却不同,没有参照,没有依傍,对每个策略的被选择的原因不太好理解,这也成为这类方法相较于之前的‘行动-价值’方法难以理解的主要原因。其实如同深度学习一样,我们只是找到一种函数可以很好的拟合决策过程就成功了。因为决策过程可以看成是一个决策函数,我们只须应用强化学习的方法尽量的逼近这个函数即可。既然是学习决策函数,那么我们很容易想到学习这个函数的参数,这样策略也就成了参数化策略,参数化的策略在选择行动时是不需要借助价值函数的。价值函数虽然不参与决策,但在策略参数学习时却可能会被用到。

    深度学习的流行,也使得基于梯度学习的算法得到广泛应用。所以基于梯度的策略学习也成为主流。基于梯度学习就需一个目标函数,比如基于梯度的策略性能函数,不妨记为(J( heta)), 接下来就很简单,我们只需最大化这个性能函数即可,其参数更新方式:

    [ heta_{t+1} = heta_t + alpha widehat{ abla J ( heta_t)} ]

    这就是策略梯算法的一般形式,其中(widehat{ abla J ( heta_t)})是一种随机估计,其期望近似于性能函数的梯度。

    Policy Approximation and its Advantages

    如果行动空间是离散的并且不是特别的在, 那么可以用一种偏好函(h(s,a, heta))数来刻画状态-行动对,在每个状态中,拥有最大偏好的行动将拥有最大的概率被选择,显然,这样的使用soft max来刻画最常用:

    [pi(a|s, heta) dot = frac{e^{h(s,a, heta)}}{sum_b e^{h(s,b, heta)}} ]

    偏好函数(h(s,a, heta))可以用ANN来刻画,也可以简单地使用特征的线性组合。

     参数化的策略有很多好处。首先,参数化的策略的估计更接近确定性策略。

    其次,基于行动偏好soft-max可以任意概率选择行动。

    再次,策略参数化相较于‘行动-价值’方法,策略的近似函数会相对更简单。

    最后,策略参数化方法可以方便的融入先验知识。

    The Policy Gradient Theorem

    上述的优势是策略参数化方法做优于‘行动-价值’方法在实践中的考虑,在理论上,策略参数化方法也有一个重要的优势:policy gradient theorem。它为性能函数的梯度提供了一个解析形式:

    [ abla J( heta) propto sum_s mu(s)sum_a q_{pi}(s,a) abla_{pi}(a|s, heta) ]

    其中(mu(s)) 是 on-policy distribution, 为证明这个定理,我们需要先考虑 ( abla v_{pi}(s)).

    [egin{array}\ abla v_{pi}(s)& = & ablaBig[sum_a pi(a|s)q_{pi}(s,a) Big] qquad forall s in S\ &=& sum_a Big[ abla pi(a|s)q_{pi}(s,a) + pi(a|s) abla q_{pi}(s,a) Big]\ &=& sum_a Big[ abla pi(a|s)q_{pi}(s,a) + pi(a|s) abla(sum_{s',r}p(s'|s,a)(r+v_{pi}(s'))) Big]\ & =& sum_aBig[ abla pi(a|s)q_{pi}(s,a) + pi(a|s)sum_{s'}p(s'|s,a) abla v_{pi}(s') Big]\ end{array} ]

    以上,我们推出从s 到 s'的递推公式, 这是一个很重要的结果。

    接下来,还需要一个认识:从状态角度看待MDP,MDP是从一个状态s到另一个状态s‘的过程。 从s到s‘是可以有多种情况的,一步到达时的概率我们是可以写出的,即:

    [p_{pi}(s ightarrow s',n =1) = sum_a pi(a|s)p(s'|s,a) ]

    其中n代表步数。那如果n = k呢,显然,我们不知道,而且可预知是很复杂的,但我们可以递推的得出,比如我们假设n=k的概率已知,即(p(s ightarrow s',n = k))已知。那么n=k+1的概率:

    [egin{array}\ p_{pi}(s ightarrow s',n = k+1) &=& sum_{s''}p(s'|s'')p_{pi}(s ightarrow s'',n=k)\ &=& sum_{s''} sum_a pi(a|s'')p(s'|s'',a)p_{pi}(s ightarrow s'',n=k)\ &= & sum_{s''}p_{pi}(s ightarrow s'',n = k)p_{pi}(s'' ightarrow s',n =1) end{array} ]

    这样我们可以继续( abla v_{pi}(s)) 的推导:

    [egin{array}\ abla v_{pi}(s)& = & sum_aBig[ abla pi(a|s)q_{pi}(s,a) + pi(a|s)sum_{s'}p(s'|s,a) abla v_{pi}(s') Big]\ & =& sum_a abla pi(a|s)q_{pi}(s,a) + sum_{s'}sum_api(a|s) p(s'|s,a) abla v_{pi}(s')\ & & ig (for simplicity, define: phi(s) =sum_a abla pi(a|s)q_{pi}(s,a) ig)\ &=& phi(s) + sum_{s'}p_{pi}(s ightarrow s',1) abla v_{pi}(s')\ & = & phi(s) + sum_{s'}p_{pi}(s ightarrow s',1)Big( phi(s') + sum_{s''}p_{pi}(s' ightarrow s'',1) abla v_{pi}(s'') Big)\ & = & phi(s) + sum_{s'}p_{pi}(s ightarrow s',1) phi(s') + sum_{s'}p_{pi}(s ightarrow s',1)sum_{s'}p_{pi}(s' ightarrow s'',1) abla v_{pi} (s'')\ & =& phi(s) + sum_{s'}p_{pi}(s ightarrow s',1) phi(s') + sum_{s''}p_{pi}(s ightarrow s'',2) abla v_{pi} (s'')\ &=& dots\ & = & sum_xsum_{k=0}^{infty}p_{pi}(s ightarrow x,k)phi(x) end{array} ]

    上面提到J( heta) 是性能函数,其有一种常用的形式是:

    [J( heta) dot = v_{pi}(s_0) ]

    于是:

    [egin{array}\ abla J( heta) &=& abla v_{pi}(s_0)\ &=& sum_ssum_{k = 0}^{infty}p_{pi}(s_0 ightarrow s)phi(s)\ &=& sum_s eta(s) phi(s)\ &=& sum_{s'}eta(s') sumfrac{eta(s)}{sum_{s'}eta(s')} phi(s)\ & propto & sum_{s}frac{eta(s)}{sum_{s'}eta(s')} phi(s)qquadqquad qquad (as sum_{s'}eta(s') is a constant.)\ &=& sum_{s} mu(s) sum_a abla pi(a|s)q_{pi}(s,a) qquad (define mu(s) = frac{eta(s)}{sum_{s'}eta(s')} 上述定理得证!)\ & = &  sum_{s} mu(s) sum_api(a|s) q_{pi}(s,a)frac{ abla pi(a|s)}{pi(a|s)}\ & =& E_{pi}ig[ q_{pi}(s,a) abla ln pi(a|s) ig] qquad qquad (E_{pi} refers to E_{ssim mu(s),asim pi_{ heta}})\ & = & E_{pi}Big[G_t abla ln pi(a|s) Big]qquad qquad (as E_{pi}[G_t|S_t,A_t] = q_{pi}(S_t,A_t)) end{array} ]

    REINFORCE: Monte Carlo Policy Gradient

    [egin{array}\ abla J( heta)& propto& sum_s mu(s)sum_a q_{pi}(s,a) abla_{pi}(a|s, heta) \ &=& E_{pi}left[sum_a pi(a|S_t, heta)q_{pi}(S_t,a) frac{ ablapi(a|S_t, heta)}{pi(a|S_t, heta)} ight]\ &=& E_{pi}left[q_{pi}(S_t,A_t)frac{ ablapi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} ight]qquadqquad ( ext{replacing a by the sample} A_t sim pi)\ &=& E_{pi}left[ G_t frac{ ablapi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} ight]qquadqquadqquad ( ext{becauser} E_{pi}[G_t | S_t,A_t] = q_{pi}(S_t,A_t)) end{array} ]

    其中(G_t)是收益(returns),由上可以得出参数更新公式:

    [ heta_{t+1} = heta_t + alpha G_t frac{ ablapi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} ]

    # REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for pi*
    Algorithm parametere: step size alpha >0
    Initialize policy parameter theta(vector)
    Loop forever:
        Generate an episode S0,A0,R1,...S_{T-1},A_{T-1},RT, following pi(.|.,theta)
        Loop for each step of the episode t = 0,1,...,T-1
        G = sum_{k=t+1}^T gamma^{k-t-1}Rk
        theta = theta + alpha r^t G grad(ln pi(A_t|S_t,theta))
    

    REINFORCE with Baseline

    作为一种Monte Carlo方法,REINFORCE的方差比较较大,从而导致学习的效率较低。而引入一个baseline却可以降低方差。

    [ abla J( heta) propto sum_s mu(s)sum_a( q_{pi}(s,a) abla_{pi}-b(s))(a|s, heta) ]

    从而:

    [ heta_{t+1} = heta_t + alpha (G_t-b(S_t)) frac{ ablapi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} ]

    baseline的一个自然的选择就是价值函数(v(s))

    # REINFORCE with baseline(episodic), for estimating pi_theta = pi*
    Input: a differentiable policy parameterization pi(a|s,theta)
    Input: a differentiable state-value function parameteration v(s,w)
    Alogrithm parameters: step sizes alpha_theta >0, alpha_w >0
    Initialize policy parameter theta and state_value weights w
    
    Loop forever(for each episode):
        Generate an episode S_0,A_0,R_1,... S_{T-1},A_{T-1},R_T following pi(.|.,theta)
        Loop for each step of the episode t= 0,1,...T-:
            G = sum_{k=t+1}^T gammma^{k-t-1} R_k
            delta = G- v(S_t,w)
            w = w + alpha_w gammma^t delta grad(v(S_t,w))
            theta = theta + alpha_theta gamm^t delta grad(ln(pi(A_t|S_t,theta)))
    

    Actor-Critic Method

    [egin{array}\ heta_{t+1} &dot =& heta_t + alpha (G_t-hat v(S_t,w)) frac{ ablapi(A_t|S_t, heta)}{pi(A_t|S_t, heta)}\ &=& heta_t + alpha(R_{t+1} + gammahat v(S_{t+1},w) - hat v(S_t,w)) frac{ ablapi(A_t|S_t, heta)}{pi(A_t|S_t, heta)}\ &=& heta + alpha delta_t frac{ ablapi(A_t|S_t, heta)}{pi(A_t|S_t, heta)}\ end{array} ]

    # one-step Acotor-Critic (episodic), for estimating pi_theta = pi*
    Input: a differentiable policy parameterization pi(a|s,theta)
    Input: a differentiable state-value function parameterization v(s,w)
    Parameters: step size alhpa_theta > 0, theta_w > 0
    Initialize policy parameter theta and state value weights w
    Loop forever (first state of episode)
    I =1
    Loop while S is not terminal (for each time step):
        A = pi(.|S,theta)
        take action A, observe S',R
        delta = R + gamma v(S',w) - v(S,w)
        w = w + alpha_w I delta grad(v(S,w))
        theta = theta + alpha_theta I delta grad(ln pi(A|S,theta))
        I = gamma I
        S = S'
    
    # Actor-Critic with Eligibility Traces(episodic), for estimating pi_theta = pi*
    Input: a differentiable policy paramterization pi(a|s,theta)
    Input: a differentiable state-value function parameterization v(s,w)
    Parameters: trace-decay rates lambda_theta in [0,1], lambda_w in [0,1], step size alpha_theta > 0, alpha_w > 0.
    Initialize policy parameter theta and state-value weights w
    
    Loop forever(for each episode):
        Initialize S (first state of episode)
        z_theta = 0 (d'-component eligibility trace vector)
        z_vector = 0 (d-component eligibility trace vector)
        I = 1
        Loop while S is not terminal (for each time step):
             A = pi(.|S,theta)
             take action A, observe S',R
             delta = R + gamma v(S',w) - v(S,w)
             z_w = gamma lambda_w z_w + I grad(v(S,w))
             z_theta = gamma lambda_theta z_theta + I grad(ln(pi(A|S,theta)))
             w = w + alpha_w delta z_w
             theta = theta + alpha_theta delta z_theta
             I = gamma I
             S = S'                             
    

    Policy Gradient for Continuing Problems

    对于连续问题,需要针对每步的平均奖励定义性能指标:

    [egin{array}\ J( heta) dot= r(pi) &dot =& lim_{h ightarrowinfty}frac{1}{h}sum_{t=1}^h E[R_t| A_{0:t-1}sim pi]\ &=&lim_{t ightarrow infty} E[R_t| A_{0:t-1} sim pi]\ &=& sum_{s}mu(s)sum_a pi(a|s) sum_{s',r}p(s',r|s,a)r end{array} ]

    其中,(mu)是在(pi)下的steady-state 分布:(mu(s)dot = lim_{t ightarrowinfty}P{S_t = s| A_{0:t}sim pi}) 假定存在且独立于(S_0)

    这是一个特别的分布,在此分布下,根据(pi)来选择行动,那么得到的结果仍符合当前分布:

    [sum_smu(s)sum_{a}pi(a|s, heta)p(s'|s,a) = mu(s'), qquad s' in S ]

    # Actor-Critic with Eligibility Traces(continuing), for estimating pi_theta = pi*
    Input: a differentiable policy paramterization pi(a|s,theta)
    Input: a differentiable state-value function parameterization v(s,w)
    Parameters: trace-decay rates lambda_theta in [0,1], lambda_w in [0,1], step size alpha_theta > 0, alpha_w > 0,alpha_{R_bar} >0,
    Initialize R_bar
    Initialize policy parameter theta and state-value weights w
    Initialize S
    z_w = 0 (eligibility trace vector)
    z_theta = 0 (eligibility trace vector)
    
    Loop forever(for each time step):
        Select A from pi(.|S,theta)
        take action A and observe S', R
        delta = R - R_bar + v(S',w) - v(S,w)
        R_bar = R_bar + alpha_{R_bar} delta
        z_w = lambda_w z_w + grad(v(S,w))
        z_theta = lambda_theta z_theta grad(ln(pi(A|S,theta)))
        w = w + alpha_w delta z_w
        theta = theta alpha_theta delta z_theta
        S = S'
    

    Policy Parameterization for Continuous Actions

    当动作空间是连续的,基于策略的方法转而关注动作分布的性质。如果每个动作可用实数值来刻画,那么策略可以用正态分布密度函数来刻画:

    [pi(a|s,pmb heta) dot = frac{1}{sigma(s,pmb heta)sqrt{2pi}}exp igg(-frac{(a - mu(s,pmb heta))^2}{2 heta(s,pmb heta)^2}igg) ]

  • 相关阅读:
    Cheapest Palindrome(区间DP)
    Dividing coins (01背包)
    Cow Exhibition (01背包)
    Bone Collector II(01背包kth)
    饭卡(01背包)
    Charm Bracelet(01背包)
    Investment(完全背包)
    Bone Collector(01背包)
    Robberies(01背包)
    ACboy needs your help(分组背包)
  • 原文地址:https://www.cnblogs.com/vpegasus/p/pg.html
Copyright © 2011-2022 走看看