zoukankan      html  css  js  c++  java
  • 强化学习(八):Eligibility Trace

    Eligibility Traces

    Eligibility Traces是强化学习中很基本很重要的一个概念。几乎所有的TD算法可以结合eligibility traces获得更一般化的算法,并且通常会更有效率。

    Eligibility traces可以将TD和Monte Carlo算法统一起来。之前我们见过n-step方法将二者统一起来,eligibility traces 优于n-step的主要地方在于计算非常有效率,其只需要一个trace向量,而不需要储存n个特征向量。另外,eligibility trace的学习是逐步的,而不需要等到n-steps之后。

    Eligibility traces机制是通过一个短期的向量得到的,这个向量称为eligibility trace (pmb{z_t} in R^d)。正好呼应算法中的权重向量(pmb{w_t}in R^d). 大致的思想是这样的:当权重向量的某些分量参与到估计价值当中,那么与之相对应的eligibility trace分量也会立即变化,然后变化会慢慢的消失。

    强化学习的很多算法都是‘向前看’的算法,即当前状态或动作的价值更新是依赖于未来状态或动作的,这样的 算法称为‘前向视角’。前向视角实施起来多少有些不方便,毕竟未来的事,当前并知道。还有另外一种思路是所谓的‘后向视角’,通过eligibility trace来考虑最近刚刚访问过的状态或动作。

    The (lambda)-return

    之前定义过n-steps return为最初n rewards加上n步之后的状态的价值估计的总和:

    [G_{t:t+n} dot = R_{t+1} + gamma R_{t+2} + ... + gamma^{n-1}R_{t+n} + gamma^n hat v(S_{t+n}, w_{t+n-1}) ]

    其实一个有效的更新不一定只依赖于n-steps return,可能依赖于多个n-step return的平均会更好,比如:

    [frac{1}{2}G_{t:t+2} + frac{1}{2}G_{t:t+4} ]

    (lambda)-return 可以理解为一种特殊的平均多个n-step return,它平均所有n-step return:

    [G_{t}^{lambda}dot = (1- lambda)sum_{n=1}^{infty}gamma^{n-1}G_{t:t+n} = (1-lambda)sum_{n=1}^{T-t-1}lambda^{n-1}G_{t:t+n} + lambda^{T-t-1}G_{t} ]

    基于(lambda) return 的学习算法比如off-line (lambda)-return algorithm:

    [w_{t+1}dot = w_t + alpha [ G_t^{lambda} - hat v(S_t, w_t)] abla hat v(S_t,w_t), t = 0,..., T-1. ]

    以上其实就是一种前向视角:

    TD((lambda))

    TD((lambda))是上面off-line (lambda)-return算法的改进,主要有三点:一是它每步更新,二是因为每步更新,它的计算过程分布在每一时间步上,三是它可以应用于连续过程而不仅限于episodic problems。

    如上所述,eligibility trace 通过一个向量(pmb{z_t}in R^d) 实现,其与(pmb{w_t})维度相同,(pmb{w_t})可看作是长时记忆,而eligibility trace则是短时记忆。

    [z_{-1}dot = 0, qquad z_tdot = gamma lambda z_{t-1} + ablahat v(S_t,w_t),qquad 0 le tle T ]

    其中(gamma)是discount rate,(lambda) 是trace decay参数。Eligibility trace记录权重向量各成分在近期状态价值评估中的贡献。

    TD error:

    [delta_t dot = R_{t+1} +gammahat v(S_{t+1},w_t)- hat v(S_t,w_t) ]

    那么权重向量的更新:

    [w_{t+1} dot = w_t+alpha delta_t z_t ]

    # semi-gradient TD(lambda) for estimating v_hat = v_pi
    Input: the policy pi to be evaluated
    Input: a differentiable function v_hat: S_plus x R^d --> R such that v_hat(terminal,.) = 0
    Algorithm parameters: step size alpha > 0, trace decay rate lambda in [0,1]
    Initialize value-function weights w arbitrarily (e.g. w = 0)
    Loop for each episode:
        Initialize S
        z = 0
        Loop for each step of episode:
            Choose action A based on pi(.|S)
            Take A, observe R,S'
            z = gamma * lambda *z + grad V(S,w)
            delta = R + gamma v_hat(S',w) - v_hat(S,w)
            w = w + alpha delta z
            S = S'
        until S' is terminal
    

    TD((lambda))是时间上反向的,从当前状态,向后考虑已经经历过的状态。

    易知,当(lambda = 0) (TD(lambda))退化成为one step semi-gradient也就是TD(0);当( = 1lambda = 1)时,TD(1)则表现出MC的特征。

    n-step Truncated (lambda)-return Methods

    线下的(lambda-return)算法具有很大的局限性,因为(lambda)-return要在最后才知道,在连续状态情况下,甚至永远得不到。我们知道,一般(gammalambda<1),因此太长时间的状态价值的影响就会消失掉,那么可以想到一种近似(lambda)-return的方法称为Truncated (lambda)-return方法:

    [G_{t:h}^{lambda}dot = (1-lambda)sum_{n =1}^{h-t-1} lambda^{n-1}G_{t:t+n} + lambda^{h-t-1}G_{t:h},qquad 0 le t<hle T ]

    同理可以推广出truncated TD((lambda))(TTD((lambda))):

    [w_{t+n}dot = W_{t+n-1} + alpha [G_{t:t+n}^lambda - hat v(S_t,w_{t+n-1})] abla hat v(S_t,w_{t+n-1}) ]

    此算法可以以一种比较有效率的方式来改写,使其不依赖于n(当然,背后的理论关系还是依赖n的):

    [G_{t:t+k}^{lambda} = hat v(S_t,w_{t-1}) + sum_{i = t}^{t+k-1} (gamma lambda)^{i-t}delta_i' ]

    其中(delta_t' dot = R_{t+1} + gammahat v(S_{t+1},w_t) -hat v(S_t,w_{t-1})).

    Redoing Updates: The On-line (lambda)-return Algorithm

    TTD((lambda))中n的选取是要有取舍的,n可以很大,这样更接近off-line (lambda)-return algorithm;或者很小这样更新参数不用等很久。这两种好处增加一些计算复杂度可以都得到。

    其思想是这样的:在每一步时,都会得到新的数据,然后从头重新更新。

    [w_{t+1}^h dot = w_t^h + alpha [G_{t:h}^h - hat v(S_t,w_t^h)] abla hat v(S_t,w_t^h),qquad 0le tle hle T. ]

    (w_t dot = w_t^t)此即为on-line (lambda)-return algorithm.

    True On-line TD((lambda))

    借助eligibility trace,将on-line (lambda) -return 的前向视角转变成后向视角即成为true on-line TD((lambda)).

    [w_{t+1}dot = w_t + alpha delta_t z_t + alpha(w_t^Tx_t - w_{t-1}^Tx_t)(z_t - x_t) ]

    其中$hat v(s,w) = w^T x(s), x_tdot = x(S_t), delta_t dot = R_{t+1} +gammahat v(S_{t+1},w_t)- hat v(S_t,w_t), z_t = dot = gammalambda z_{t-1}+(1-alpha gammalambda z_{t-1}^Tx_t)x_t $.

    # True On-line TD(lambda) for estimating w^T x = v_pi
    Input: the policy pi to be evaluated
    Input: a feature function x: S_plus --> R such that x(terminal,.) = 0
    Algorithm parameters: step size alpha > 0, trace decay rate lambda in [0,1]
    Initialize value-function weights w in R
    
    Loop for each episode:
        Initialize state and obtain initial feature vector x
        z = 0
        V_old = 0
        Loop for each step of episode:
            choose A from pi.
            take action A, observe R, x'(feature vector of the next state)
            V = w^T x
            V' = w^T x'
            delta = R + gamma V' - V
            z = gamma lambda z + (1- alpha gamma z^T x)x
            w = w + alpha(delta + V- V_old)z - alpha (V-V_old) x
            V_old = V'
            x = x'
        until x' = 0(signaling arrival at a terminal state)
    

    这里的eligibility trace 称为 dutch trace, 还有一种称为replacing trace:

    [z_{i,t} dot =left{egin{array}\ 1qquad ifquad x_{i,t} = 1,\ gammalambda z_{i,t-1} qquad otherwise.end{array} ight. ]

    Dutch Traces in Monte Carlo Learning

    Eligibility traces也可应用于Monte Carlo 相产算法之中。

    在梯度Monte Carlo算法的线性版本中:

    [w_{t+1} dot = w_t +alpha [G - w_t^Tx_t]x_t,qquad 0 le t le T. ]

    其中G是在episode结束后得到的奖励,因此没有下标,也没有折扣因子。因为MC方法只能在episode结束后更新参数,这个特点我们不试图改进,而是在尽力将计算分布在各个时间步上。

    [egin{array}\ w_T& =& w_{T-1} + alpha(G - w_{T-1}^Tx_{T-1})x_{T-1}\ &=& w_{T-1} +alpha x_{T-1}(-x_{T-1}^Tw_{T-1})+alpha Gx_{T-1}\ &=& (I - alpha x_{T-1}x_{T-1}^T)w_{T-1}+alpha Gx_{T-1}\ &=& F_{T-1}w_{T-1}+alpha Gx_{T-1} end{array} ]

    其中(F_t dot = I - alpha x_t x_t^T)是一个遗忘矩阵或称衰退矩阵(forgetting or fading matrix), 可以递归得到。

    Sarsa((lambda))

    Sarsa((lambda))是一个TD方法,与TD((lambda))类似,只不过,Sarsa((lambda))是针对state-action pair的:

    [w_{t+1}dot = w_t +alpha delta_t z_t ]

    其中:

    [delta dot = R_{t+1}+gamma hat q(S_{t+1},A_{t+1},w_t) - hat q(S_t,A_t,w_t) ]

    而:

    [egin{array}\ z_{-1} &=& 0,\ z_t&=& gamma lambda z_{t-1}+ ablahat q(S_t,A_t,w_t),qquad 0le tle T. end{array} ]

    # Sarsa(lambda) with binary features and linear function approximation for estimating # w^T x q_{pi} or q*
    Input: a function F(s,a) returning teh set of (indices of ) active features for s,a
    Input: a policy pi (if estimating q_pi)
    Algorithm parameteres: step size alpha > 0, trace decay rate lambda in [0,1]
    Initialize: w= (w1,...,wd)^T in R^d z = (z1,...,z_d)^T in R^d
    
    Loop for each episode:
        Initialize S
        Choose A from pi(.|S) or e-greedy according to q_hat(S,.,w)
        z = 0
        Loop for each step of episode:
           take action A, observe R,S'
           delta =  R
           Loop for i in F(S,A):
               delta = delta - w_i
               z_i = z_i + 1  (accumulating traces)
               or z_i = 1 (replacing traces)
           If S' is terminal then:
                w = w + alpha delta z
                Go to next episode
           Choose A' in pi(.|S) or near greedily q_hat(S',.,w)
           Loop for i in F(S',A'): dleta = delta + gamma w_i
           w = w + alpha delta z
           z = gamma lambda z
           S = S', A = A'           
    
    # True Online Sarsa(lambda) for estimating w^Tx = q_pi or q*
    Input: a feature function x: s_plus x A = R^d such that x(terminal,.) = 0
    Input: a policy pi (if estimating q_pi)
    Algorithm parameters: step size alpha >0, trace decay rate lambda in [0,1]
    Initialize: w in R^d
    
    Loop for each episode:
        Initialize S
        Choose A from pi(.|S) or nearly greedily from S using w
        x = x(S,A)
        z = 0
        Q_old = 0
        Loop for each step of episode:
            Take action A, observe R,S'
            Choose A' from pi(.|S') or near greedily from S' using w
            x' = x(S',A')
            Q = w^Tx
            Q' = w^Tx'
            delta = R + gamma Q' - Q
            z = gamma lambda z + (1 - alpha gamma lambda z^T x) x
            w = w + alpha(delta + Q - Q_old )z - alpha(Q - Q_old) x
            Q_old = Q'
            x = x'
            A = A'
        until S' is terminal	    
    

    Off-policy Eligibility Traces with Control Variates

    Watkins's Q((lambda)) to Tree-Backup((lambda))

    Stable Off-policy Methods with Traces

  • 相关阅读:
    [Luogu P3626] [APIO2009] 会议中心
    杭电 1869 六度分离 (求每两个节点间的距离)
    杭电 1874 畅通工程续 (求某节点到某节点的最短路径)
    最短路径模板
    杭电 2544 最短路径
    POJ 1287 Networking (最小生成树模板题)
    NYOJ 1875 畅通工程再续 (无节点间距离求最小生成树)
    POJ 2485 Highways (求最小生成树中最大的边)
    杭电 1233 还是畅通工程 (最小生成树)
    杭电 1863 畅通工程 (最小生成树)
  • 原文地址:https://www.cnblogs.com/vpegasus/p/et.html
Copyright © 2011-2022 走看看