Monte Carlo Methods
MC方法不需要对环境完全了解,只需要可以从环境中采样即可。MC方法基于平均样品收益(Averaging Sample Returns)。通常,MC方法应用于片段式任务(Episodic Tasks)。
Monte Carlo Prediction
First-visit MC 与 Every-visit MC。
# First-visit MC prediction, for estimating V = v_pi
Input: a policy pi to be evaluated
V(s), arbitrarily, for all s in S
Returns(s) = list(), for all s in S
While True:
Generate an episode following pi: S0,A0,R1,S1,A1,R2,...,ST-1,AT-1,RT
G = 0
for t in range(T-1,0):
G = gamma G +R_{t+1}
if St not in S0,S1,...,St-1:
V(St) = mean(Returns(St))
Monte Carlo Estimation of Action Values
Exploring Starts:每一个s,a对的概率都大于0
# Monte Carlo ES(Exploring Starts), for estimating pi = pi*
# Initialize:
pi(s) for all s in S
Q(s,a) for all s in S,a in A(s)
Returns(s,a) = list() for all s in S, a in A(s)
While True:
Choose S0 in S and A0 in A(S0) s.t. for all p(s,a)>0 # ES
Generate an episode from S0, A0, following pi: S0,A0,R1,...ST-1,AT-1,RT
G = 0
for t in range(T-1,0):
G = gamma G + R_{t+1}
if St,At not in S0,A0,S1,A1,...,S_{t-1},A_{t-1}:
Q(St,At) = mean(Returns(St,At))
pi(St) = argmax_a Q(St,a)
Monte Carlo Control
Monte Carlo Control without Exploring Starts
ES的假设在现实中经常不满足,也即普适性不强。如果要去掉这个不太合理的前提,那么就只能假定所有的动作可以被无数多次抽取。要保证此项有两种方法: on-policy 与 off-policy两种方法。on-policy方法评估与提升的策略与做出决策的是同一个策略,而off-policy方法评估与提升的策略与用于产生数据的策略不是同一个。on-policy方法一般比较简单,通常首先考虑,而off-policy方法因为另一个不同策略的存在,会引入额外的工作,而且相较于on-policy方法会较大的方差且收敛较慢。
on-policy方法通常是soft的,即(pi(a|s)>0,quad forall s in S,and ain A(s)),但会逐渐地逼近确定性策略(deterministic optimal policy)。
# on-policy first-visit MC control (for epsilon-soft policies), estimating pi = pi*
# Initialize:
pi = an arbitrary epsilon-soft policy
Q(s,a) arbitrarily for all s in S, a in A(s)
Returns(s,a) = list() for all s in S, a in A(s)
while True:
Generate an episode following pi: S0,A0,R1,...,ST-1,AT-1,RT
G = 0
for t in range(T-1,0):
G = gamma G + R_{t+1}
if pair(St,At) not in S0,A0,S1,A1,...,S_{t-1},A_{t-1}
Q(St,At) = mean(Returns(St,At))
A* = argmax_a Q(St,a)
for a in A(St):
if a == A*:
pi(a|St) = 1- epsilon + epsilon/|A(St)|
pi(a|St) = epsilon/|A(St)|
Off-policy Prediction via Importance Sampling
所有的学习控制方法都会面临exploratory-exploitation dilemma,一方面为了习得每个动作的价值,随后的每个行为都应该是最优的,另一方面为了寻找最优,就要尝试各种动作,这与选择最优又是矛盾的。off-policy方法则同时使用两种策略,一种用于学习最优策略,称为目标策略,另一种用于产生数据(行为),称为行为策略。
off-policy有一个前提假设: 所有的动作在目标策略下可以发生,那么在行为策略下也一定会发生。也就是假定:(if pi(a|s)>0,then b(a|s)>0). 这被称为收敛假设。
Importance sampling
IS是一种通用的技术,主要应用于已知样品从某一分布中抽取,估计这些样品在另一分布下的期望价值。IS 应用于off-policy学习,通过加权收益的方式,权重由样品在目标策略与行为数据的概率的相对比率(IS ratio)得到。
那么IS ratio:
可见,IS ratio最终只依赖于两种策略及样品序列,而对是否是所谓的MDP则没有要求(状态转移概率被约去了)。
我们想要估计的是在目标策略下的期望收益,但我们能得到的收益都是在行为策略下得到的,显然这样的期望收益是有问题的:(E[G_t|S_t = s] = v_b(s)),这时IS ratio就派上用场了:(E[ ho_{t:T-1}G_t|S_t = s] = v_{pi}(s))。
其中(J(s))表示在所有time step中,状态s被访问的集合。上式被称为ordinary IS, 另一种称为weighted IS:
Incremental Implementation
# off-policy MC prediction (policy evaluation) for estimating Q = q_pi
#Initialize, for all s in S,a in A(s)
C(s,a) = 0
while True:
b = any policy with coverage of pi
generate an episode following b: S0,A0,R1,...,ST-1,AT-1,RT
G = 0
W = 1
for t in range(T-1,0):
G = gamma G + R_{t+1}
C(St,At) = C(St,At) + W
Q(St,At) = Q(St,At) + W/C(St,At)[G - Q(St,At)]
W = W * pi(At|St)/b(At|St)
if W = 0:
# off-policy MC control for estimating pi=pi*
#Initialize, for all s in S, a in A(s):
C(s,a) = 0
pi(s) = argmax_a Q(s,a) (with ties broken consistently)
while True:
b = any soft policy
Generate an episode using b: S0,A0,R1,...,ST-1,AT-1,RT
G = 0
W = 1
for t in range(T-1,0):
G = gamma G + R_{t+1}
C(St,At) = C(St,At) + W
Q(St,At) = Q(St,At) + W/C(St,At)[G - Q(St,At)]
pi(St) = argmax_a Q(St,a) (with ties broken consistently)
If At != pi(St):
W = W * 1/b(At|St)