zoukankan      html  css  js  c++  java
  • 强化学习(四):蒙特卡洛方法

    Monte Carlo Methods

    MC方法不需要对环境完全了解,只需要可以从环境中采样即可。MC方法基于平均样品收益(Averaging Sample Returns)。通常,MC方法应用于片段式任务(Episodic Tasks)。

    Monte Carlo Prediction

    First-visit MC 与 Every-visit MC。

    # First-visit MC prediction, for estimating V = v_pi
    
    Input: a policy pi to be evaluated
    Initialize:
        V(s), arbitrarily, for all s in S
        Returns(s) = list(), for all s in S
        
    While True:
        Generate an episode following pi: S0,A0,R1,S1,A1,R2,...,ST-1,AT-1,RT
        G = 0
        for t in range(T-1,0):
            G = gamma G +R_{t+1}
            if St not in S0,S1,...,St-1:
                Returns(St).append(G)
                V(St) = mean(Returns(St))       
    

    Monte Carlo Estimation of Action Values

    Exploring Starts:每一个s,a对的概率都大于0

    # Monte Carlo ES(Exploring Starts), for estimating pi = pi*
    
    # Initialize:
    pi(s) for all s in S
    Q(s,a) for all s in S,a in A(s)
    Returns(s,a) = list() for all s in S, a in A(s)
    While True:
        Choose S0 in S and A0 in A(S0) s.t. for all p(s,a)>0 # ES
        Generate an episode from S0, A0, following pi: S0,A0,R1,...ST-1,AT-1,RT
        G = 0
        for t in range(T-1,0):
            G = gamma G + R_{t+1}
            if St,At not in S0,A0,S1,A1,...,S_{t-1},A_{t-1}:
                Returns(St,At).append(G)
                Q(St,At) = mean(Returns(St,At))
                pi(St) = argmax_a Q(St,a)
    

    Monte Carlo Control

    [pi_0xrightarrow {quad Equad}q_{pi_0}xrightarrow {quad Iquad}pi_1xrightarrow {quad Equad}...xrightarrow {quad Iquad} pi_*xrightarrow {quad Equad}q_* ]

    Monte Carlo Control without Exploring Starts

    ES的假设在现实中经常不满足,也即普适性不强。如果要去掉这个不太合理的前提,那么就只能假定所有的动作可以被无数多次抽取。要保证此项有两种方法: on-policy 与 off-policy两种方法。on-policy方法评估与提升的策略与做出决策的是同一个策略,而off-policy方法评估与提升的策略与用于产生数据的策略不是同一个。on-policy方法一般比较简单,通常首先考虑,而off-policy方法因为另一个不同策略的存在,会引入额外的工作,而且相较于on-policy方法会较大的方差且收敛较慢。

    on-policy方法通常是soft的,即(pi(a|s)>0,quad forall s in S,and ain A(s)),但会逐渐地逼近确定性策略(deterministic optimal policy)。

    # on-policy first-visit MC control (for epsilon-soft policies), estimating pi = pi*
    
    # Initialize:
    pi = an arbitrary epsilon-soft policy
    Q(s,a) arbitrarily for all s in S, a in A(s)
    Returns(s,a) = list() for all s in S, a in A(s)
    
    while True:
        Generate an episode following pi: S0,A0,R1,...,ST-1,AT-1,RT
        G = 0
        for t in range(T-1,0):
            G = gamma G + R_{t+1}
            if pair(St,At) not in S0,A0,S1,A1,...,S_{t-1},A_{t-1}
            Returns(St,At).append(G)
            Q(St,At) = mean(Returns(St,At))
            A* = argmax_a Q(St,a)
            for a in A(St):
                if a == A*:
                	pi(a|St) = 1- epsilon + epsilon/|A(St)|
                else:
                    pi(a|St) = epsilon/|A(St)|
    

    Off-policy Prediction via Importance Sampling

    所有的学习控制方法都会面临exploratory-exploitation dilemma,一方面为了习得每个动作的价值,随后的每个行为都应该是最优的,另一方面为了寻找最优,就要尝试各种动作,这与选择最优又是矛盾的。off-policy方法则同时使用两种策略,一种用于学习最优策略,称为目标策略,另一种用于产生数据(行为),称为行为策略

    off-policy有一个前提假设: 所有的动作在目标策略下可以发生,那么在行为策略下也一定会发生。也就是假定:(if pi(a|s)>0,then b(a|s)>0). 这被称为收敛假设

    Importance sampling

    IS是一种通用的技术,主要应用于已知样品从某一分布中抽取,估计这些样品在另一分布下的期望价值。IS 应用于off-policy学习,通过加权收益的方式,权重由样品在目标策略与行为数据的概率的相对比率(IS ratio)得到。

    在目标策略下,

    [P{ A_t,S_{t+1},A_{t+1},...,S_T|S_t,A_{t:T-1} sim pi} \ = pi(A_t|S_t)p(S_{t+1}|S_t,A_t)pi(A_{t+1}|S_{t+1})...p(S_T|S_{T-1},A_{T-1})\ = Pi_{k =t}^{T-1} pi(A_k|S_k)p(S_{k+1}|S_k,A_k) qquadqquadqquadqquadqquadqquad ]

    其中(p)是状态转移概率。

    那么IS ratio:

    [ ho_{t:T-1} dot = frac{Pi_{k =t}^{T-1} pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{Pi_{k =t}^{T-1} b(A_k|S_k)p(S_{k+1}|S_k,A_k)} = Pi_{k =t}^{T-1}frac{pi(A_k|S_k)}{b(A_k|S_k)} ]

    可见,IS ratio最终只依赖于两种策略及样品序列,而对是否是所谓的MDP则没有要求(状态转移概率被约去了)。

    我们想要估计的是在目标策略下的期望收益,但我们能得到的收益都是在行为策略下得到的,显然这样的期望收益是有问题的:(E[G_t|S_t = s] = v_b(s)),这时IS ratio就派上用场了:(E[ ho_{t:T-1}G_t|S_t = s] = v_{pi}(s))

    这样要估计价值函数:

    [V(s) dot = frac{sum_{tin J(s)} ho_{t:T(t)-1}G_T}{|J(s)|} ]

    其中(J(s))表示在所有time step中,状态s被访问的集合。上式被称为ordinary IS, 另一种称为weighted IS:

    [V(s) dot = frac{sum_{tin J(s)} ho_{t:T(t)-1}G_T}{sum_{tin J(s)} ho_{t:T(t)-1}} ]

    Incremental Implementation

    [V_{n+1} = frac{sum_{k=1}^{n}W_kG_k}{sum_{k=1}^n W_k} = V_n + frac{W_n}{C_n}[G_n - V_n], quad where, C_{n+1} = C_n + W_{n+1},C_0 = 0, n>=1 ]

    # off-policy MC prediction (policy evaluation) for estimating Q = q_pi
    
    #Initialize, for all s in S,a in A(s)
    Q(s,a)
    C(s,a) = 0
    while True:
        b = any policy with coverage of pi
        generate an episode following b: S0,A0,R1,...,ST-1,AT-1,RT
        G = 0
        W = 1
        for t in range(T-1,0):
            G = gamma G + R_{t+1}
            C(St,At) = C(St,At) + W
            Q(St,At) = Q(St,At) + W/C(St,At)[G - Q(St,At)]
            W = W * pi(At|St)/b(At|St)
            if W = 0:
                break
    
    # off-policy MC control for estimating pi=pi*
    
    #Initialize, for all s in S, a in A(s):
    Q(s,a)
    C(s,a) = 0
    pi(s) = argmax_a Q(s,a) (with ties broken consistently)
    while True:
        b = any soft policy
        Generate an episode using b: S0,A0,R1,...,ST-1,AT-1,RT
        G = 0
        W = 1
        for t in range(T-1,0):
            G = gamma G + R_{t+1}
            C(St,At) = C(St,At) + W
            Q(St,At) = Q(St,At) + W/C(St,At)[G - Q(St,At)]
            pi(St) = argmax_a Q(St,a) (with ties broken consistently)
            If At != pi(St):
                break
            else:
                W = W * 1/b(At|St)
    
  • 相关阅读:
    Java实现 LeetCode 394 字符串解码
    Java实现 LeetCode 394 字符串解码
    Java实现 LeetCode 392 判断子序列
    Java实现 LeetCode 392 判断子序列
    Java实现 LeetCode 392 判断子序列
    Java实现 LeetCode 391 完美矩形
    Java实现 LeetCode 391 完美矩形
    Java实现 LeetCode 391 完美矩形
    Java实现 LeetCode 390 消除游戏
    Java实现 LeetCode 390 消除游戏
  • 原文地址:https://www.cnblogs.com/vpegasus/p/mc.html
Copyright © 2011-2022 走看看