策略算法(如TRPO,PPO)是一种流行的on-policy方法。它可以提供无偏差的(或近似无偏差)梯度估计,但同时会导致高的方差。而像Q-learning 和离线的actor-critic(如DDPG)等off-policy方法则可以用离线的样本来替代。它们可以使用其他学习过程产生的样本。这样的方法大大提高了采样的效率。不过并不能保证非线性函数逼近能够收敛。
"介于回合更新与单步更新之间的算法"
GAE 它既可以被看做使用off-policy的评价过程来减小策略梯度方法带来的方差,又被看作使用on-policy蒙特卡洛方法来修正评价梯度方法带来的偏差。 0 <λ<1的广义优势估计量在偏差和方差之间折衷,由参数λ控制。(lambda=0)时方差低,偏差大,(lambda=1)时方差高,偏差低。
1、TD($lambda $)
Monte Carlo算法需要运行完整的episode,利用所有观察到的真是的reward(奖励值)来更新算法。Temporal Difference(TD)算法仅当前时刻采样的reward(奖励值)进行value function的估计。一个折中的方法就是利用n步的reward(奖励进行估计)。
n步返回的定义:(R_t^{n} doteq r_{t+1}+gamma r_{t+2}+cdots+gamma^{n-1} r_{t+n}+gamma^{n} hat{v}left(S_{t+n}, mathrm{w}_{t+n-1} ight), quad 0 leq t leq T-n)
加上权重值后:(R_{t}^{lambda } doteq(1-lambda) sum_{n=1}^{infty} lambda^{n-1} R_t^{n})
第一列实际上是TD(1)方法,其权重为1-λ
,第二列是TD(2),权重为(1-λ)λ
…,直到最后一个TD(n)的权重为(λ^{T-t-1}))( T是episode的长度)。权重随n的增加而衰减,总和为1。因为(sum_{n=1}^{infty} lambda^{n-1}=sum_{n=0}^{infty} lambda^{n}=frac{1}{1-lambda})
2、$gamma $-just
定义1 估计量$$
hat{A}_{t}
underset{s_{0} ; infty atop a_{0: infty}}{mathbb{E}}left[hat{A}{t}left(s{0: infty}, a_{0: infty} ight) abla_{ heta} log pi_{ heta}left(a_{t} | s_{t} ight) ight]=underset{s_{0: infty} atop a_{0: infty}}{mathbb{E}}left[A^{pi, gamma}left(s_{t}, a_{t} ight) abla_{ heta} log pi_{ heta}left(a_{t} | s_{t} ight) ight]
hat{A}_{t}
hat{A}{t}left(s{0: infty}, a_{0: infty}
ight)=Q_{t}left(s_{t: infty}, a_{t: infty}
ight)-
b_{t}left(s_{0: t}, a_{0: t-1}
ight)
left(s_{t}, a_{t} ight), quad mathbb{E}{s{t+1}: infty, a_{t+1: infty}} |{s{t}, a_{t}}left[Q_{t}left(s_{t: infty}, a_{t: infty} ight) ight]=Q^{pi, gamma}left(s_{t}, a_{t} ight)
hat{A}_{t}
egin{array}{l}
{mathbb{E}{s{0: infty}, a_{0: infty}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight)left(Q_{t}left(s_{0: infty}, a_{0: infty}
ight)-b_{t}left(s_{0: t}, a_{0: t-1}
ight)
ight)
ight]}
{quad=mathbb{E}{s{0: infty}, a_{0: infty}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight)left(Q_{t}left(s_{0: infty}, a_{0: infty}
ight)
ight)
ight]}
{quad-mathbb{E}{s{0: infty}, a_{0: infty}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight)left(b_{t}left(s_{0: t}, a_{0: t-1}
ight)
ight)
ight]}
end{array}
egin{array}{l}
{mathbb{E}{s{0}: infty}, a_{0: infty}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight) Q_{t}left(s_{0: infty}, a_{0: infty}
ight)
ight]}
{quad=mathbb{E}{s{0: t}, a_{0: t}}left[mathbb{E}{s{t+1: infty}, a_{t+1 ; infty}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight) Q_{t}left(s_{0: infty}, a_{0: infty}
ight)
ight]
ight]}
{=mathbb{E}{s{0: t}, a_{0: t}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight) mathbb{E}{s{t+1}: infty, a_{t+1: infty}}left[Q_{t}left(s_{0: infty}, a_{0: infty}
ight)
ight]
ight]}
{=mathbb{E}{s{0: t}, a_{0: t-1}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight) A^{pi}left(s_{t}, a_{t}
ight)
ight]}
end{array}
egin{array}{l}
{mathbb{E}{s{0: infty}, a_{0: infty}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight) b_{t}left(s_{0: t}, a_{0: t-1}
ight)
ight]}
{quad=mathbb{E}{s{0: t}, a_{0: t-1}}left[mathbb{E}{s{t+1: infty}, a_{t: infty}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight) b_{t}left(s_{0: t}, a_{0: t-1}
ight)
ight]
ight]}
{quad=mathbb{E}{s{0: t}, a_{0: t-1}}left[mathbb{E}{s{t+1: infty}, a_{t: infty}}left[
abla_{ heta} log pi_{ heta}left(a_{t} | s_{t}
ight)
ight] b_{t}left(s_{0: t}, a_{0: t-1}
ight)
ight]}
{=mathbb{E}{s{0: t}, a_{0: t-1}}left[0 cdot b_{t}left(s_{0: t}, a_{0: t-1}
ight)
ight]}
{=0}
end{array}
V=V^{pi, gamma}
V=V^{pi, gamma}
egin{aligned}&hat{A}{t}{(1)}:=delta_{t}{V} quad=-Vleft(s{t} ight)+r_{t}+gamma Vleft(s_{t+1} ight)&hat{A}{t}{(2)}:=delta_{t}{V}+gamma delta{t+1}^{V} quad=-Vleft(s_{t} ight)+r_{t}+gamma r_{t+1}+gamma^{2} Vleft(s_{t+2} ight)&hat{A}{t}{(3)}:=delta_{t}{V}+gamma delta{t+1}{V}+gamma{2} delta_{t+2}^{V}=-Vleft(s_{t} ight)+r_{t}+gamma r_{t+1}+gamma^{2} r_{t+2}+gamma^{3} Vleft(s_{t+3} ight)&hat{A}{t}{(k)}:=sum_{l=0}{k-1} gamma^{l} delta{t+l}^{V}=-Vleft(s_{t} ight)+r_{t}+gamma r_{t+1}+cdots+gamma^{k-1} r_{t+k-1}+gamma^{k} Vleft(s_{t+k} ight)end{aligned} (11-14)\hat{A}{t}{(infty)}=sum_{l=0}{infty} gamma^{l} delta{t+l}{V}=-Vleft(s_{t} ight)+sum_{l=0}{infty} gamma^{l} r_{t+l} (15)\begin{aligned}hat{A}{t}^{mathrm{GAE}(gamma, lambda)} &:=(1-lambda)left(hat{A}{t}^{(1)}+lambda hat{A}{t}{(2)}+lambda{2} hat{A}{t}^{(3)}+ldots ight) &=(1-lambda)left(delta_{t}{V}+lambdaleft(delta_{t}{V}+gamma delta_{t+1}{V} ight)+lambda{2}left(delta_{t}^{V}+gamma delta_{t+1}{V}+gamma{2} delta_{t+2}^{V} ight)+ldots ight) &=(1-lambda)left(delta_{t}{V}+lambdaleft(delta_{t}{V}+lambda delta_{t+1}{V} ight)+lambda{2}left(delta_{t}^{V}+gamma delta_{t+1}{V}+gamma{2} delta_{t+2}^{V} ight)+ldots ight) &=(1-lambda)left(delta_{t}{V}left(1+lambda+lambda{2}+ldots ight)+gamma delta_{t+1}{V}left(lambda+lambda{2}+lambda{2}+ldots ight) ight.&left.quad+gamma{2} delta_{t+2}{V}left(lambda{2}+lambda{3}+lambda{4}+ldots ight)+ldots ight) &=(1-lambda)left(delta_{t}^{V}left(frac{1}{1-lambda} ight)+gamma delta_{t+1}{V}left(frac{lambda}{1-lambda} ight)+gamma{2} delta_{t+2}{V}left(frac{lambda{2}}{1-lambda} ight)+ldots ight) &=sum_{l=0}^{infty}(gamma lambda)^{l} delta_{t+l}^{V}end{aligned} (16)
egin{aligned} hat{Q}{M^{prime}}(s, a) &=mathbb{E}{s^{prime} sim P_{s a}}left[rleft(s, a, s^{prime} ight)+Fleft(s, a, s^{prime} ight)+gamma max {a^{prime} in A}left(hat{Q}{M{prime}}left(s{prime}, a^{prime} ight) ight) ight] &=mathbb{E}{s^{prime} sim P{s a}}left[r^{prime}left(s, a, s^{prime} ight)+gamma max {a^{prime} in A}left(hat{Q}{M{prime}}left(s{prime}, a^{prime} ight) ight) ight] end{aligned}