zoukankan      html  css  js  c++  java
  • 强化学习(四):蒙特卡洛方法

    Monte Carlo Methods

    MC方法不需要对环境完全了解,只需要可以从环境中采样即可。MC方法基于平均样品收益(Averaging Sample Returns)。通常,MC方法应用于片段式任务(Episodic Tasks)。

    Monte Carlo Prediction

    First-visit MC 与 Every-visit MC。

    # First-visit MC prediction, for estimating V = v_pi
    
    Input: a policy pi to be evaluated
    Initialize:
        V(s), arbitrarily, for all s in S
        Returns(s) = list(), for all s in S
        
    While True:
        Generate an episode following pi: S0,A0,R1,S1,A1,R2,...,ST-1,AT-1,RT
        G = 0
        for t in range(T-1,0):
            G = gamma G +R_{t+1}
            if St not in S0,S1,...,St-1:
                Returns(St).append(G)
                V(St) = mean(Returns(St))       
    

    Monte Carlo Estimation of Action Values

    Exploring Starts:每一个s,a对的概率都大于0

    # Monte Carlo ES(Exploring Starts), for estimating pi = pi*
    
    # Initialize:
    pi(s) for all s in S
    Q(s,a) for all s in S,a in A(s)
    Returns(s,a) = list() for all s in S, a in A(s)
    While True:
        Choose S0 in S and A0 in A(S0) s.t. for all p(s,a)>0 # ES
        Generate an episode from S0, A0, following pi: S0,A0,R1,...ST-1,AT-1,RT
        G = 0
        for t in range(T-1,0):
            G = gamma G + R_{t+1}
            if St,At not in S0,A0,S1,A1,...,S_{t-1},A_{t-1}:
                Returns(St,At).append(G)
                Q(St,At) = mean(Returns(St,At))
                pi(St) = argmax_a Q(St,a)
    

    Monte Carlo Control

    [pi_0xrightarrow {quad Equad}q_{pi_0}xrightarrow {quad Iquad}pi_1xrightarrow {quad Equad}...xrightarrow {quad Iquad} pi_*xrightarrow {quad Equad}q_* ]

    Monte Carlo Control without Exploring Starts

    ES的假设在现实中经常不满足,也即普适性不强。如果要去掉这个不太合理的前提,那么就只能假定所有的动作可以被无数多次抽取。要保证此项有两种方法: on-policy 与 off-policy两种方法。on-policy方法评估与提升的策略与做出决策的是同一个策略,而off-policy方法评估与提升的策略与用于产生数据的策略不是同一个。on-policy方法一般比较简单,通常首先考虑,而off-policy方法因为另一个不同策略的存在,会引入额外的工作,而且相较于on-policy方法会较大的方差且收敛较慢。

    on-policy方法通常是soft的,即(pi(a|s)>0,quad forall s in S,and ain A(s)),但会逐渐地逼近确定性策略(deterministic optimal policy)。

    # on-policy first-visit MC control (for epsilon-soft policies), estimating pi = pi*
    
    # Initialize:
    pi = an arbitrary epsilon-soft policy
    Q(s,a) arbitrarily for all s in S, a in A(s)
    Returns(s,a) = list() for all s in S, a in A(s)
    
    while True:
        Generate an episode following pi: S0,A0,R1,...,ST-1,AT-1,RT
        G = 0
        for t in range(T-1,0):
            G = gamma G + R_{t+1}
            if pair(St,At) not in S0,A0,S1,A1,...,S_{t-1},A_{t-1}
            Returns(St,At).append(G)
            Q(St,At) = mean(Returns(St,At))
            A* = argmax_a Q(St,a)
            for a in A(St):
                if a == A*:
                	pi(a|St) = 1- epsilon + epsilon/|A(St)|
                else:
                    pi(a|St) = epsilon/|A(St)|
    

    Off-policy Prediction via Importance Sampling

    所有的学习控制方法都会面临exploratory-exploitation dilemma,一方面为了习得每个动作的价值,随后的每个行为都应该是最优的,另一方面为了寻找最优,就要尝试各种动作,这与选择最优又是矛盾的。off-policy方法则同时使用两种策略,一种用于学习最优策略,称为目标策略,另一种用于产生数据(行为),称为行为策略

    off-policy有一个前提假设: 所有的动作在目标策略下可以发生,那么在行为策略下也一定会发生。也就是假定:(if pi(a|s)>0,then b(a|s)>0). 这被称为收敛假设

    Importance sampling

    IS是一种通用的技术,主要应用于已知样品从某一分布中抽取,估计这些样品在另一分布下的期望价值。IS 应用于off-policy学习,通过加权收益的方式,权重由样品在目标策略与行为数据的概率的相对比率(IS ratio)得到。

    在目标策略下,

    [P{ A_t,S_{t+1},A_{t+1},...,S_T|S_t,A_{t:T-1} sim pi} \ = pi(A_t|S_t)p(S_{t+1}|S_t,A_t)pi(A_{t+1}|S_{t+1})...p(S_T|S_{T-1},A_{T-1})\ = Pi_{k =t}^{T-1} pi(A_k|S_k)p(S_{k+1}|S_k,A_k) qquadqquadqquadqquadqquadqquad ]

    其中(p)是状态转移概率。

    那么IS ratio:

    [ ho_{t:T-1} dot = frac{Pi_{k =t}^{T-1} pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{Pi_{k =t}^{T-1} b(A_k|S_k)p(S_{k+1}|S_k,A_k)} = Pi_{k =t}^{T-1}frac{pi(A_k|S_k)}{b(A_k|S_k)} ]

    可见,IS ratio最终只依赖于两种策略及样品序列,而对是否是所谓的MDP则没有要求(状态转移概率被约去了)。

    我们想要估计的是在目标策略下的期望收益,但我们能得到的收益都是在行为策略下得到的,显然这样的期望收益是有问题的:(E[G_t|S_t = s] = v_b(s)),这时IS ratio就派上用场了:(E[ ho_{t:T-1}G_t|S_t = s] = v_{pi}(s))

    这样要估计价值函数:

    [V(s) dot = frac{sum_{tin J(s)} ho_{t:T(t)-1}G_T}{|J(s)|} ]

    其中(J(s))表示在所有time step中,状态s被访问的集合。上式被称为ordinary IS, 另一种称为weighted IS:

    [V(s) dot = frac{sum_{tin J(s)} ho_{t:T(t)-1}G_T}{sum_{tin J(s)} ho_{t:T(t)-1}} ]

    Incremental Implementation

    [V_{n+1} = frac{sum_{k=1}^{n}W_kG_k}{sum_{k=1}^n W_k} = V_n + frac{W_n}{C_n}[G_n - V_n], quad where, C_{n+1} = C_n + W_{n+1},C_0 = 0, n>=1 ]

    # off-policy MC prediction (policy evaluation) for estimating Q = q_pi
    
    #Initialize, for all s in S,a in A(s)
    Q(s,a)
    C(s,a) = 0
    while True:
        b = any policy with coverage of pi
        generate an episode following b: S0,A0,R1,...,ST-1,AT-1,RT
        G = 0
        W = 1
        for t in range(T-1,0):
            G = gamma G + R_{t+1}
            C(St,At) = C(St,At) + W
            Q(St,At) = Q(St,At) + W/C(St,At)[G - Q(St,At)]
            W = W * pi(At|St)/b(At|St)
            if W = 0:
                break
    
    # off-policy MC control for estimating pi=pi*
    
    #Initialize, for all s in S, a in A(s):
    Q(s,a)
    C(s,a) = 0
    pi(s) = argmax_a Q(s,a) (with ties broken consistently)
    while True:
        b = any soft policy
        Generate an episode using b: S0,A0,R1,...,ST-1,AT-1,RT
        G = 0
        W = 1
        for t in range(T-1,0):
            G = gamma G + R_{t+1}
            C(St,At) = C(St,At) + W
            Q(St,At) = Q(St,At) + W/C(St,At)[G - Q(St,At)]
            pi(St) = argmax_a Q(St,a) (with ties broken consistently)
            If At != pi(St):
                break
            else:
                W = W * 1/b(At|St)
    
  • 相关阅读:
    判断 undefined and ( == null) and (!something) and ( == null)
    textarea高度自适应自动展开
    退出 js和Jquery区别
    javascript高级程序设计 学习笔记 第五章 下
    Bind, Call and Apply in JavaScript
    javascript高级程序设计 学习笔记 第五章 上
    小程序入门---登录流程
    Array类型 JS
    深入浅出妙用 Javascript 中 apply、call、bind
    微信公众号开发(与angular框架相结合)
  • 原文地址:https://www.cnblogs.com/vpegasus/p/mc.html
Copyright © 2011-2022 走看看