zoukankan      html  css  js  c++  java
  • 强化学习读书笔记

    强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods)

    学习笔记:
    Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

    参照

    需要了解强化学习的数学符号,先看看这里:

    策略梯度方法(Policy Gradient Methods)

    基于价值函数的思路

    [ ext{Reinforcement Learning} doteq pi_* \ quad updownarrow \ pi_* doteq { pi(s) }, s in mathcal{S} \ quad updownarrow \ egin{cases} pi(s) = underset{a}{argmax} v_{pi}(s' | s, a), s' in S(s), quad ext{or} \ pi(s) = underset{a}{argmax} q_{pi}(s, a) \ end{cases} \ quad updownarrow \ egin{cases} v_*(s), quad ext{or} \ q_*(s, a) \ end{cases} \ quad updownarrow \ ext{approximation cases:} \ egin{cases} hat{v}(s, heta) doteq heta^T phi(s), quad ext{state value function} \ hat{q}(s, a, heta) doteq heta^T phi(s, a), quad ext{action value function} \ end{cases} \ where \ heta ext{ - value function's weight vector} \ ]

    策略梯度方法的新思路(Policy Gradient Methods)

    [ ext{Reinforcement Learning} doteq pi_* \ quad updownarrow \ pi_* doteq { pi(s) }, s in mathcal{S} \ quad updownarrow \ pi(s) = underset{a}{argmax} pi(a|s, heta) \ where \ pi(a|s, heta) in [0, 1] \ s in mathcal{S}, a in mathcal{A} \ quad updownarrow \ pi(a|s, heta) doteq frac{exp(h(s,a, heta))}{sum_b exp(h(s,b, heta))} \ quad updownarrow \ exp(h(s,a, heta)) doteq heta^T phi(s,a) \ where \ heta ext{ - policy weight vector} \ ]

    策略梯度定理(The policy gradient theorem)

    情节性任务

    如何计算策略的价值(eta)

    [eta( heta) doteq v_{pi_ heta}(s_0) \ where \ eta ext{ - the performance measure} \ v_{pi_ heta} ext{ - the true value function for } pi_ heta ext{, the policy determined by } heta \ s_0 ext{ - some particular state} \ ]

    • 策略梯度定理

    [ abla eta( heta) = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a) abla_ heta pi(a|s, heta) \ where \ d(s) ext{ - on-policy distribution, the fraction of time spent in s under the target policy } pi \ sum_s d(s) = 1 \ ]

    蒙特卡洛策略梯度强化算法(ERINFORCE: Monte Carlo Policy Gradient)

    • 策略价值计算公式

    [egin{align} abla eta( heta) & = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a) abla_ heta pi(a|s, heta) \ & = mathbb{E}_pi left [ gamma^t sum_a q_pi(S_t,a) abla_ heta pi(a|s, heta) ight ] \ & = mathbb{E}_pi left [ gamma^t G_t frac{ abla_ heta pi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} ight ] end{align} ]

    • Update Rule公式

    [egin{align} heta_{t+1} & doteq heta_t + alpha gamma^t G_t frac{ abla_ heta pi(A_t|S_t, heta)}{pi(A_t|S_t, heta)} \ & = heta_t + alpha gamma^t G_t abla_ heta log pi(A_t|S_t, heta) \ end{align} ]

    • 算法描述(ERINFORCE: A Monte Carlo Policy Gradient Method (episodic))
      请看原书,在此不做拗述。

    带基数的蒙特卡洛策略梯度强化算法(ERINFORCE with baseline)

    • 策略价值计算公式

    [egin{align} abla eta( heta) & = sum_s d_{pi}(s) sum_{a} q_{pi}(s,a) abla_ heta pi(a|s, heta) \ & = sum_s d_{pi}(s) sum_{a} left ( q_{pi}(s,a) - b(s) ight ) abla_ heta pi(a|s, heta) \ end{align} \ ecause \ sum_{a} b(s) abla_ heta pi(a|s, heta) \ quad = b(s) abla_ heta sum_{a} pi(a|s, heta) \ quad = b(s) abla_ heta 1 \ quad = 0 \ where \ b(s) ext{ - an arbitrary baseline function, e.g. } b(s) = hat{v}(s, w) \ ]

    • Update Rule公式

    [delta = G_t - hat{v}(s, w) \ w_{t+1} = w_{t} + eta delta abla_w hat{v}(s, w) \ heta_{t+1} = heta_t + alpha gamma^t delta abla_ heta log pi(A_t|S_t, heta) \ ]

    • 算法描述
      请看原书,在此不做拗述。

    角色评论算法(Actor-Critic Methods)

    这个算法实际上是:

    1. 带基数的蒙特卡洛策略梯度强化算法的TD通用化。
    2. 加上资格迹(eligibility traces)

    注:蒙特卡洛方法要求必须完成当前的情节。这样才能计算正确的回报(G_t)
    TD避免了这个条件(从而提高了效率),可以通过临时差分计算一个近似的回报(G_t^{(0)} approx G_t)(当然也产生了不精确性)。
    资格迹(eligibility traces)优化了(计算权重变量的)价值函数的微分值,(e_t doteq abla hat{v}(S_t, heta_t) + gamma lambda e_{t-1})

    • Update Rule公式

    [delta = G_t^{(1)} - hat{v}(S_t, w) \ quad = R_{t+1} + gamma hat{v}(S_{t+1}, w) - hat{v}(S_t, w) \ w_{t+1} = w_{t} + eta delta abla_w hat{v}(s, w) \ heta_{t+1} = heta_t + alpha gamma^t delta abla_ heta log pi(A_t|S_t, heta) \ ]

    • Update Rule with eligibility traces公式

    [delta = R + gamma hat{v}(s', w) - hat{v}(s', w) \ e^w = lambda^w e^w + gamma^t abla_w hat{v}(s, w) \ w_{t+1} = w_{t} + eta delta e_w \ e^{ heta} = lambda^{ heta} e^{ heta} + gamma^t abla_ heta log pi(A_t|S_t, heta) \ heta_{t+1} = heta_t + alpha delta e^{ heta} \ where \ R + gamma hat{v}(s', w) = G_t^{(0)} \ delta ext{ - TD error} \ e^w ext{ - eligibility trace of state value function} \ e^{ heta} ext{ - eligibility trace of policy value function} \ ]

    • 算法描述
      请看原书,在此不做拗述。

    针对连续性任务的策略梯度算法(Policy Gradient for Continuing Problems(Average Reward Rate))

    • 策略价值计算公式
      对于连续性任务的策略价值是每个步骤的平均奖赏

    [egin{align} eta( heta) doteq r( heta) & doteq lim_{n o infty} frac{1}{n} sum_{t=1}^n mathbb{E} [R_t| heta_0= heta_1=dots= heta_{t-1}= heta] \ & = lim_{t o infty} mathbb{E} [R_t| heta_0= heta_1=dots= heta_{t-1}= heta] \ end{align} ]

    • Update Rule公式

    [delta = G_t^{(1)} - hat{v}(S_t, w) \ quad = R_{t+1} + gamma hat{v}(S_{t+1}, w) - hat{v}(S_t, w) \ w_{t+1} = w_{t} + eta delta abla_w hat{v}(s, w) \ heta_{t+1} = heta_t + alpha gamma^t delta abla_ heta log pi(A_t|S_t, heta) \ ]

    • Update Rule Actor-Critic with eligibility traces (continuing) 公式

    [delta = R - ar{R} + gamma hat{v}(s', w) - hat{v}(s', w) \ ar{R} = ar{R} + eta delta \ e^w = lambda^w e^w + gamma^t abla_w hat{v}(s, w) \ w_{t+1} = w_{t} + eta delta e_w \ e^{ heta} = lambda^{ heta} e^{ heta} + gamma^t abla_ heta log pi(A_t|S_t, heta) \ heta_{t+1} = heta_t + alpha delta e^{ heta} \ where \ R + gamma hat{v}(s', w) = G_t^{(0)} \ delta ext{ - TD error} \ e^w ext{ - eligibility trace of state value function} \ e^{ heta} ext{ - eligibility trace of policy value function} \ ]

    • 算法描述(Actor-Critic with eligibility traces (continuing))
      请看原书,在此不做拗述。
      原书还没有完成,这章先停在这里
  • 相关阅读:
    zoj 3279 线段树 OR 树状数组
    fzu 1962 树状数组 OR 线段树
    hdu 5057 块状链表
    hdu3487 Play with Chain
    bzoj 1588营业额统计(HNOI 2002)
    poj2823 Sliding Window
    poj2828 Buy Tickets
    poj2395 Out of Hay
    poj3667 Hotel
    poj1703 Lost Cows
  • 原文地址:https://www.cnblogs.com/steven-yang/p/6624253.html
Copyright © 2011-2022 走看看