Temporal-Difference Learning
TD在强化学习中处于中心位置,它结合了DP与MC两种思想。如MC, TD可以直接从原始经验中学习,且不需要对环境有整体的认知。也如DP一样,它不需要等到最终结果才开始学习,它Bootstrap,即它的每步估计会部分地基于之前的估计。
最简单的TD形式:
[V(S_t) leftarrow V(S_t) + alpha [R_{t+1} + gamma V(S_{t+1} ) - V(S_t)]
]
这个可被称为TD(0)或一步TD(one-step TD)。
# Tabular TD(0) for estimating v_pi
Input: the policy pi to be evaluated
Algorithm parameter: step size alpha in (0,1]
Initialize V(s), for all s in S_plus, arbitrarily except that V(terminal) = 0
Loop for each episode
Initialize S
for step in episode:
A = action given by pi for S
Take action A, observe R, S'
V(S) = V(S) + alpha *[ R+gamma V(S') - V(S)]
S = S'
if S == terminal:
break
TD error:
[delta_t dot = R_{t+1} + gamma V(S_{t+1}) - V(S_t)
]
在每一时刻,TD error是因为估计所产生的误差。
Advantage of TD Prediction Methods
Sarsa: On-policy TD Control
[Q(S_t,A_t) leftarrow Q(S_t,A_t) + alpha[R_{t+1} + gamma Q(S_{t+1}, A_{t+1}) - Q(S_t,A_t)]
]
Sarsa (State, Action, Reward, State, Action) 表达是这个五元组元素之间的关系。TD error 可表示
[delta_t = R_{t+1} + gamma Q(S_{t+1}, A_{t+1}) - Q(S_t,A_t)
]
# Sarsa (on-policy TD control) for estimating Q = q
Algorithm parameters: step size alpha in (0,1], small epsilon > 0
Initialize Q(s,a), for all s in S_plus, a in A(s), arbitrarily except that Q(terminal,.) = 0
Loop for each episode:
Initialize S
Choose A from S using policy derived from Q (e.g., epsilon-greedy)
Loop for each step of episode:
Take action A, observe R, S'
Choose A' from S' using policy derived from Q (e.g., epsilon-greedy)
Q(S,A) = Q(S,A) + alpha[R + gamma Q(S',A') - Q(S,A)]
S = S',A=A'
if S = terminal:
break
Q-learning: Off-policy TD Control
[Q(S_t,A_t) leftarrow Q(S_t,A_t) + alpha[R_{t+1} + gamma max_a Q(S_{t+1}, a) - Q(S_t,A_t)]
]
# Sarsa (on-policy TD control) for estimating Q = q
Algorithm parameters: step size alpha in (0,1], small epsilon > 0
Initialize Q(s,a), for all s in S_plus, a in A(s), arbitrarily except that Q(terminal,.) = 0
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using policy derived from Q (e.g., epsilon-greedy)
Take action A, observe R, S'
Q(S,A) = Q(S,A) + alpha[R + gamma max_a Q(S',a) - Q(S,A)]
S = S'
if S = terminal:
break
Q-learning 直接逼近q*, 最优的action-value 函数独立于行为策略。
Expected Sarsa
[Q(S_t,A_t) leftarrow Q(S_t,A_t) + alpha[R_{t+1} + gamma E[ Q(S_{t+1}, A_{t+1})|S_{t+1}] - Q(S_t,A_t)]\ leftarrow Q(S_t,A_t) +alpha [R_{t+1} + gamma sum_{a}pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t)]
]
Double Q-learning
[Q(S_t,A_t) leftarrow Q(S_t,A_t) + alpha[R_{t+1} + gamma Q_2(S_{t+1}, argmax_a Q_1(S_{t+1},a)) - Q(S_t,A_t)]
]
# Double Q-learning, for estimating Q1 = Q2 = q*
Algorithm parameters: step size alpha in (0,1],small epsilon >0
Initialize Q1(s,a) and Q2(s,a), for all s in S_plus, a in A(s), such that Q(terminal,.) = 0
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using the policy epsilon-greedy in Q1+Q2
Take action A, observe R, S'
with 0.5 probability:
Q1(S,A) = Q1(S,A) + alpha(R + gamma Q2(S',arg max_a Q1(S',a)) - Q1(S,A))
else:
Q2(S,A) = Q2(S,A) + alpha(R + gamma Q1(S',arg max_a Q2(S',a)) - Q2(S,A))
S = S'
if S = terminal:
break