zoukankan html css js c++ java

Policy Gradient Algorithms

2019-10-02 17:37:47

This blog is from: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html

Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, SAC, TD3 & SVPG.

What is Policy Gradient
Policy Gradient Algorithms Quick Summary
- REINFORCE
- Actor-Critic
- Off-Policy Policy Gradient
- A3C
- A2C
- DPG
- DDPG
- D4PG
- MADDPG
- TRPO
- PPO
- ACER
- ACTKR
- SAC
- SAC with Automatically Adjusted Temperature
- TD3
- SVPG
References

What is Policy Gradient

Policy gradient is an approach to solve reinforcement learning problems. If you haven’t looked into the field of reinforcement learning, please first read the section “A (Long) Peek into Reinforcement Learning » Key Concepts” for the problem definition and key concepts.

Notations

Here is a list of notations to help you read through equations in the post easily.

Symbol	Meaning
$s \in S$	States.
$a \in A$	Actions.
$r \in R$	Rewards.
$S_{t}, A_{t}, R_{t}$	State, action, and reward at time step t of one trajectory. I may occasionally use $s_{t}, a_{t}, r_{t}$
$γ$	Discount factor; penalty to uncertainty of future rewards; $0 < γ \leq 1$
$G_{t}$	Return; or discounted future reward; $G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$
$P (s^{'}, r \| s, a)$	Transition probability of getting to the next state s’ from the current state s with action a and reward r.
$π (a \| s)$	Stochastic policy (agent behavior strategy); $π_{θ} (.)$
$μ (s)$	Deterministic policy; we can also label this as $π (s)$
$V (s)$	State-value function measures the expected return of state s; $V_{w} (.)$
$V^{π} (s)$	The value of state s when we follow a policy π; $V^{π} (s) = E_{a \sim π} [G_{t} \| S_{t} = s]$
$Q (s, a)$	Action-value function is similar to $V (s)$
$Q^{π} (s, a)$	Similar to $V^{π} (.)$
$A (s, a)$	Advantage function, $A (s, a) = Q (s, a) - V (s)$

Policy Gradient

The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function respect to θ, $π_{θ} (a | s)$

The reward function is defined as:

J (θ) = \sum s \in S d π (s) V π (s) = \sum s \in S d π (s) \sum a

where $d^{π} (s)$

It is natural to expect policy-based methods are more useful in the continuous space. Because there is an infinite number of actions and (or) states to estimate the values for and hence value-based approaches are way too expensive computationally in the continuous space. For example, in generalized policy iteration, the policy improvement step $\arg max_{a \in A} Q^{π} (s, a)$

Using gradient ascent, we can move θ toward the direction suggested by the gradient $\nabla_{θ} J (θ)$

Policy Gradient Theorem

Computing the gradient $\nabla_{θ} J (θ)$

Luckily, the policy gradient theorem comes to save the world! Woohoo! It provides a nice reformation of the derivative of the objective function to not involve the derivative of the state distribution $d^{π} (.)$

\nabla θ J (θ) = \nabla θ \sum s \in S d π (s) \sum a \in A Q π (

Proof of Policy Gradient Theorem

This session is pretty dense, as it is the time for us to go through the proof (Sutton & Barto, 2017; Sec. 13.1) and figure out why the policy gradient theorem is correct.

We first start with the derivative of the state value function:

= = = = = \nabla θ V π (s) \nabla θ (\sum a \in A

Now we have:

\nabla θ V π (s) = \sum a \in A (\nabla θ π θ (a | s) Q π (s, a) + π θ

This equation has a nice recursive form (see the red parts!) and the future state value function $V^{π} (s^{'})$

Let’s consider the following visitation sequence and label the probability of transitioning from state s to state x with policy $π_{θ}$

s -\to---- a \sim π θ (. | s) s' -\to---- a \sim π θ (. | s') s'' -\to----- a \sim π θ (

When k = 0: $ρ^{π} (s \to s, k = 0) = 1$
When k = 1, we scan through all possible actions and sum up the transition probabilities to the target state: $ρ^{π} (s \to s^{'}, k = 1) = \sum_{a} π_{θ} (a | s) P (s^{'} | s, a)$
Imagine that the goal is to go from state s to x after k+1 steps while following policy $π_{θ}$

Then we go back to unroll the recursive representation of $\nabla_{θ} V^{π} (s)$

= = = = = = = = = \nabla θ V π (s) ϕ (

The nice rewriting above allows us to exclude the derivative of Q-value function, $\nabla_{θ} Q^{π} (s, a)$

\nabla θ J (θ) = \nabla θ V π (s 0) = \sum s

In the episodic case, the constant of proportionality ( $\sum_{s} η (s)$

\nabla θ J (θ) \propto \sum s \in S d π (s) \sum a \in A Q π (s

Where $E_{π}$

The policy gradient theorem lays the theoretical foundation for various policy gradient algorithms. This vanilla policy gradient update has no bias but high variance. Many following algorithms were proposed to reduce the variance while keeping the bias unchanged.

\nabla θ J (θ) = E π [Q π (s, a) \nabla θ ln π θ (a | s)]

Here is a nice summary of a general form of policy gradient methods borrowed from the GAE (general advantage estimation) paper (Schulman et al., 2016) and this post thoroughly discussed several components in GAE , highly recommended.

General form

Fig. 1. A general form of policy gradient methods. (Image source: Schulman et al., 2016)

Policy Gradient Algorithms

Tons of policy gradient algorithms have been proposed during recent years and there is no way for me to exhaust them. I’m introducing some of them that I happened to know and read about.

REINFORCE

REINFORCE (Monte-Carlo policy gradient) relies on an estimated return by Monte-Carlo methods using episode samples to update the policy parameter $θ$

\nabla θ J (θ) = E π [Q π (s, a) \nabla θ ln π θ (a | s)]

Therefore we are able to measure $G_{t}$

The process is pretty straightforward:

Initialize the policy parameter θ at random.
Generate one trajectory on policy $π_{θ}$
For t=1, 2, … , T:
1. Estimate the the return $G_{t}$
2. Update policy parameters: $θ \leftarrow θ + α γ^{t} G_{t} \nabla_{θ} \ln π_{θ} (A_{t} | S_{t})$

A widely used variation of REINFORCE is to subtract a baseline value from the return $G_{t}$

Actor-Critic

Two main components in policy gradient are the policy model and the value function. It makes a lot of sense to learn the value function in addition to the policy, since knowing the value function can assist the policy update, such as by reducing gradient variance in vanilla policy gradients, and that is exactly what the Actor-Critic method does.

Actor-critic methods consist of two models, which may optionally share parameters:

Critic updates the value function parameters w and depending on the algorithm it could be action-value $Q_{w} (a | s)$
Actor updates the policy parameters θ for $π_{θ} (a | s)$

Let’s see how it works in a simple action-value actor-critic algorithm.

Initialize s, θ, w at random; sample $a \sim π_{θ} (a | s)$
For $t = 1 \dots T$
1. Sample reward $r_{t} \sim R (s, a)$
2. Then sample the next action $a^{'} \sim π_{θ} (a^{'} | s^{'})$
3. Update the policy parameters: $θ \leftarrow θ + α_{θ} Q_{w} (s, a) \nabla_{θ} \ln π_{θ} (a | s)$
4. Compute the correction (TD error) for action-value at time t:
  $δ_{t} = r_{t} + γ Q_{w} (s^{'}, a^{'}) - Q_{w} (s, a)$
5. Update $a \leftarrow a^{'}$

Two learning rates, $α_{θ}$

Off-Policy Policy Gradient

Both REINFORCE and the vanilla version of actor-critic method are on-policy: training samples are collected according to the target policy — the very same policy that we try to optimize for. Off policy methods, however, result in several additional advantages:

The off-policy approach does not require full trajectories and can reuse any past episodes (“experience replay”) for much better sample efficiency.
The sample collection follows a behavior policy different from the target policy, bringing better exploration.

Now let’s see how off-policy policy gradient is computed. The behavior policy for collecting samples is a known policy (predefined just like a hyperparameter), labelled as $β (a | s)$

J (θ) = \sum s \in S d β (s) \sum a \in A Q π (s, a) π θ (a | s) =

where $d^{β} (s)$

Given that the training observations are sampled by $a \sim β (a | s)$

\nabla θ J (θ) = \nabla θ E s \sim d β [\sum a \in

where $\frac{π_{θ} (a | s)}{β (a | s)}$

In summary, when applying policy gradient in the off-policy setting, we can simple adjust it with a weighted sum and the weight is the ratio of the target policy to the behavior policy, $\frac{π_{θ} (a | s)}{β (a | s)}$

A3C

[paper|code]

Asynchronous Advantage Actor-Critic (Mnih et al., 2016), short for A3C, is a classic policy gradient method with a special focus on parallel training.

In A3C, the critics learn the value function while multiple actors are trained in parallel and get synced with global parameters from time to time. Hence, A3C is designed to work well for parallel training.

Let’s use the state-value function as an example. The loss function for state value is to minimize the mean squared error, $J_{v} (w) = (G_{t} - V_{w} (s))^{2}$

Here is the algorithm outline:

We have global parameters, θ and w; similar thread-specific parameters, θ’ and w’.
Initialize the time step $t = 1$
While $T <= T_{MAX}$
1. Reset gradient: dθ = 0 and dw = 0.
2. Synchronize thread-specific parameters with global ones: θ’ = θ and w’ = w.
3. $t_{start}$
4. While ( $s_{t}$
  1. Pick the action $A_{t} \sim π_{θ^{'}} (A_{t} | S_{t})$
  2. Update t = t + 1 and T = T + 1
5. Initialize the variable that holds the return estimation $R = {\begin{cases} 0 & if s_{t} is TERMINAL \\ V_{w^{'}} (s_{t}) & otherwise \end{cases}$
6. For $i = t - 1, \dots, t_{start}$
  1. $R \leftarrow γ R + R_{i}$
  2. Accumulate gradients w.r.t. θ’: $d θ \leftarrow d θ + \nabla_{θ^{'}} \log π_{θ^{'}} (a_{i} | s_{i}) (R - V_{w^{'}} (s_{i}))$
7. Update asynchronously θ using dθ, and w using dw.

A3C enables the parallelism in multiple agent training. The gradient accumulation step (6.2) can be considered as a parallelized reformation of minibatch-based stochastic gradient update: the values of w or θ get corrected by a little bit in the direction of each training thread independently.

A2C

[paper|code]

A2C is a synchronous, deterministic version of A3C; that’s why it is named as “A2C” with the first “A” (“asynchronous”) removed. In A3C each agent talks to the global parameters independently, so it is possible sometimes the thread-specific agents would be playing with policies of different versions and therefore the aggregated update would not be optimal. To resolve the inconsistency, a coordinator in A2C waits for all the parallel actors to finish their work before updating the global parameters and then in the next iteration parallel actors starts from the same policy. The synchronized gradient update keeps the training more cohesive and potentially to make convergence faster.

A2C has been shown to be able to utilize GPUs more efficiently and work better with large batch sizes while achieving same or better performance than A3C.

A2C

Fig. 2. The architecture of A3C versus A2C.

DPG

[paper|code]

In methods described above, the policy function $π (. | s)$

Refresh on a few notations to facilitate the discussion:

$ρ_{0} (s)$
$ρ^{μ} (s \to s^{'}, k)$
$ρ^{μ} (s^{'})$

The objective function to optimize for is listed as follows:

J (θ) = \int S ρ μ (s) Q (s, μ θ (s)) d s

Deterministic policy gradient theorem: Now it is the time to compute the gradient! According to the chain rule, we first take the gradient of Q w.r.t. the action a and then take the gradient of the deterministic policy function μ w.r.t. θ:

\nabla θ J (θ) = \int S ρ μ (s) \nabla a Q μ (s, a) \nabla θ μ θ (s)

We can consider the deterministic policy as a special case of the stochastic one, when the probability distribution contains only one extreme non-zero value over one action. Actually, in the DPG paper, the authors have shown that if the stochastic policy $π_{μ_{θ}, σ}$

The deterministic policy gradient theorem can be plugged into common policy gradient frameworks.

Let’s consider an example of on-policy actor-critic algorithm to showcase the procedure. In each iteration of on-policy actor-critic, two actions are taken deterministically $a = μ_{θ} (s)$

δ t w t + 1 θ t + 1 = R t + γ Q w (s t + 1, a

However, unless there is sufficient noise in the environment, it is very hard to guarantee enough exploration due to the determinacy of the policy. We can either add noise into the policy (ironically this makes it nondeterministic!) or learn it off-policy-ly by following a different stochastic behavior policy to collect samples.

Say, in the off-policy approach, the training trajectories are generated by a stochastic policy $β (a | s)$

J β (θ) \nabla θ J β (θ) = \int S ρ β Q μ (s, μ θ (s)) d s

Note that because the policy is deterministic, we only need $Q^{μ} (s, μ_{θ} (s))$

DDPG

[paper|code]

DDPG (Lillicrap, et al., 2015), short for Deep Deterministic Policy Gradient, is a model-free off-policy actor-critic algorithm, combining DPG with DQN. Recall that DQN (Deep Q-Network) stabilizes the learning of Q-function by experience replay and the frozen target network. The original DQN works in discrete space, and DDPG extends it to continuous space with the actor-critic framework while learning a deterministic policy.

In order to do better exploration, an exploration policy μ’ is constructed by adding noise $N$

μ' (s) = μ θ (s) + N

In addition, DDPG does soft updates (“conservative policy iteration”) on the parameters of both actor and critic, with $τ ≪ 1$

One detail in the paper that is particularly useful in robotics is on how to normalize the different physical units of low dimensional features. For example, a model is designed to learn a policy with the robot’s positions and velocities as input; these physical statistics are different by nature and even statistics of the same type may vary a lot across multiple robots. Batch normalization is applied to fix it by normalizing every dimension across samples in one minibatch.

DDPG

Fig 3. DDPG Algorithm. (Image source: Lillicrap, et al., 2015)

D4PG

[paper|code (Search “github d4pg” and you will see a few.)]

Distributed Distributional DDPG (D4PG) applies a set of improvements on DDPG to make it run in the distributional fashion.

(1) Distributional Critic: The critic estimates the expected Q value as a random variable ~ a distribution $Z_{w}$

The deterministic policy gradient update becomes:

\nabla θ J (θ) \approx E ρ μ [\nabla a Q w (s, a) \nabla θ μ θ (s)

(2) $N$ : When calculating the TD error, D4PG computes $N$

r (s 0, a 0) + E [\sum n = 1 N - 1 r (s n, a n) + γ N Q (s N,

(3) Multiple Distributed Parallel Actors: D4PG utilizes $K$

(4) Prioritized Experience Replay (PER): The last piece of modification is to do sampling from the replay buffer of size $R$

DDPG

Fig. 4. D4PG algorithm (Image source: Barth-Maron, et al. 2018); Note that in the original paper, the variable letters are chosen slightly differently from what in the post; i.e. I use $μ (.)$

MADDPG

[paper|code]

Multi-agent DDPG (MADDPG) (Lowe et al., 2017)extends DDPG to an environment where multiple agents are coordinating to complete tasks with only local information. In the viewpoint of one agent, the environment is non-stationary as policies of other agents are quickly upgraded and remain unknown. MADDPG is an actor-critic model redesigned particularly for handling such a changing environment and interactions between agents.

The problem can be formalized in the multi-agent version of MDP, also known as Markov games. Say, there are N agents in total with a set of states $S$

Let $\vec{o} = o_{1}, \dots, o_{N}$

The critic in MADDPG learns a centralized action-value function $Q_{i}^{\vec{μ}} (\vec{o}, a_{1}, \dots, a_{N})$

Actor update:

\nabla θ i J (θ i) = E o ⃗, a \sim D [\nabla a i Q μ

Where $D$

Critic update:

L (θ i) where y = E o ⃗, a 1, \dots, a N, r 1, \dots,

where ${\vec{μ}}^{'}$

If the policies $\vec{μ}$

To mitigate the high variance triggered by the interaction between competing or collaborating agents in the environment, MADDPG proposed one more element - policy ensembles:

Train K policies for one agent;
Pick a random policy for episode rollouts;
Take an ensemble of these K policies to do gradient update.

In summary, MADDPG added three additional ingredients on top of DDPG to make it adapt to the multi-agent environment:

Centralized critic + decentralized actors;
Actors are able to use estimated policies of other agents for learning;
Policy ensembling is good for reducing variance.

MADDPG

Fig. 5. The architecture design of MADDPG. (Image source: Lowe et al., 2017)

TRPO

[paper|code]

To improve training stability, we should avoid parameter updates that change the policy too much at one step. Trust region policy optimization (TRPO) (Schulman, et al., 2015) carries out this idea by enforcing a KL divergence constraint on the size of policy update at each iteration.

If off policy, the objective function measures the total advantage over the state visitation distribution and actions, while the rollout is following a different behavior policy $β (a | s)$

J (θ) = \sum s \in S ρ π θ old \sum a \in A (

where $θ_{old}$

If on policy, the behavior policy is $π_{θ_{old}} (a | s)$

J (θ) = E s \sim ρ π θ old, a \sim π θ old [π θ ( a | s

TRPO aims to maximize the objective function $J (θ)$

E s \sim ρ π θ old [D KL (π θ old (. | s) ∥ π θ (. | s)] \leq

In this way, the old and new policies would not diverge too much when this hard constraint is met. While still, TRPO can guarantee a monotonic improvement over policy iteration (Neat, right?). Please read the proof in the paper if interested :)

PPO

[paper|code]

Given that TRPO is relatively complicated and we still want to implement a similar constraint, proximal policy optimization (PPO) simplifies it by using a clipped surrogate objective while retaining similar performance.

First, let’s denote the probability ratio between old and new policies as:

r (θ) = π θ ( a | s ) π θ old ( a | s )

Then, the objective function of TRPO (on policy) becomes:

J TRPO (θ) = E [r (θ) A^θ old (s, a)]

Without a limitation on the distance between $θ_{old}$

J CLIP (θ) = E [min (r (θ) A^θ old (s, a), clip (r (θ), 1 - ϵ, 1 +

The function $clip (r (θ), 1 - ϵ, 1 + ϵ)$

When applying PPO on the network architecture with shared parameters for both policy (actor) and value (critic) functions, in addition to the clipped reward, the objective function is augmented with an error term on the value estimation (formula in red) and an entropy term (formula in blue) to encourage sufficient exploration.

J CLIP' (θ) = E [J CLIP (θ) - c 1 (V θ (s) - V target) 2 + c 2 H (s,

where Both $c_{1}$

PPO has been tested on a set of benchmark tasks and proved to produce awesome results with much greater simplicity.

ACER

[paper|code]

ACER, short for actor-critic with experience replay (Wang, et al., 2017), is an off-policy actor-critic model with experience replay, greatly increasing the sample efficiency and decreasing the data correlation. A3C builds up the foundation for ACER, but it is on policy; ACER is A3C’s off-policy counterpart. The major obstacle to making A3C off policy is how to control the stability of the off-policy estimator. ACER proposes three designs to overcome it:

Use Retrace Q-value estimation;
Truncate the importance weights with bias correction;
Apply efficient TRPO.

Retrace Q-value Estimation

Retrace is an off-policy return-based Q-value estimation algorithm with a nice guarantee for convergence for any target and behavior policy pair (π, β), plus good data efficiency.

Recall how TD learning works for prediction:

Compute TD error: $δ_{t} = R_{t} + γ E_{a \sim π} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})$
Update the value by correcting the error to move toward the goal: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α δ_{t}$

When the rollout is off policy, we need to apply importance sampling on the Q update:

Δ Q imp (S t, A t) = γ t \prod 1 \leq τ \leq t π ( A τ | S τ ) β ( A τ |

The product of importance weights looks pretty scary when we start imagining how it can cause super high variance and even explode. Retrace Q-value estimation method modifies $Δ Q$

Δ Q ret (S t, A t) = γ t \prod 1 \leq τ \leq t min (c, π ( A τ | S τ ) β (

ACER uses $Q^{ret}$

Importance weights truncation

To reduce the high variance of the policy gradient $\hat{g}$

g^acer t = = ω t (Q ret (S t, A t) - V θ v (

where $Q_{w} (.)$

Efficient TRPO

Furthermore, ACER adopts the idea of TRPO but with a small adjustment to make it more computationally efficient: rather than measuring the KL divergence between policies before and after one update, ACER maintains a running average of past policies and forces the updated policy to not deviate far from this average.

The ACER paper is pretty dense with many equations. Hopefully, with the prior knowledge on TD learning, Q-learning, importance sampling and TRPO, you will find the paper slightly easier to follow :)

ACTKR

[paper|code]

ACKTR (actor-critic using Kronecker-factored trust region) (Yuhuai Wu, et al., 2017) proposed to use Kronecker-factored approximation curvature (K-FAC) to do the gradient update for both the critic and actor. K-FAC made an improvement on the computation of natural gradient, which is quite different from our standard gradient. Here is a nice, intuitive explanation of natural gradient. One sentence summary is probably:

“we first consider all combinations of parameters that result in a new network a constant KL divergence away from the old network. This constant value can be viewed as the step size or learning rate. Out of all these possible combinations, we choose the one that minimizes our loss function.”

I listed ACTKR here mainly for the completeness of this post, but I would not dive into details, as it involves a lot of theoretical knowledge on natural gradient and optimization methods. If interested, check these papers/posts, before reading the ACKTR paper:

Amari. Natural Gradient Works Efficiently in Learning. 1998
Kakade. A Natural Policy Gradient. 2002
A intuitive explanation of natural gradient descent
Wiki: Kronecker product
Martens & Grosse. Optimizing neural networks with kronecker-factored approximate curvature. 2015.

Here is a high level summary from the K-FAC paper:

“This approximation is built in two stages. In the first, the rows and columns of the Fisher are divided into groups, each of which corresponds to all the weights in a given layer, and this gives rise to a block-partitioning of the matrix. These blocks are then approximated as Kronecker products between much smaller matrices, which we show is equivalent to making certain approximating assumptions regarding the statistics of the network’s gradients.

In the second stage, this matrix is further approximated as having an inverse which is either block-diagonal or block-tridiagonal. We justify this approximation through a careful examination of the relationships between inverse covariances, tree-structured graphical models, and linear regression. Notably, this justification doesn’t apply to the Fisher itself, and our experiments confirm that while the inverse Fisher does indeed possess this structure (approximately), the Fisher itself does not.”

SAC

[paper|code]

Soft Actor-Critic (SAC) (Haarnoja et al. 2018) incorporates the entropy measure of the policy into the reward to encourage exploration: we expect to learn a policy that acts as randomly as possible while it is still able to succeed at the task. It is an off-policy actor-critic model following the maximum entropy reinforcement learning framework. A precedent work is Soft Q-learning.

Three key components in SAC:

An actor-critic architecture with separate policy and value function networks;
An off-policy formulation that enables reuse of previously collected data for efficiency;
Entropy maximization to enable stability and exploration.

The policy is trained with the objective to maximize the expected return and the entropy at the same time:

J (θ) = \sum t = 1 T E (s t, a t) \sim ρ π θ [r (s t, a t)

where $H (.)$

Precisely, SAC aims to learn three functions:

The policy with parameter $θ$
Soft Q-value function parameterized by $w$
Soft state value function parameterized by $ψ$

Soft Q-value and soft state value are defined as:

Q (s t, a t) where V (s t) = r (s t, a t) + γ E s

Thus, Q (s t, a t) = r (s t, a t) + γ E (s t + 1, a t + 1)

$ρ_{π} (s)$

The soft state value function is trained to minimize the mean squared error:

J V (ψ) with gradient: \nabla ψ J V (ψ) = E s t \sim D [1 2 (

where $D$

The soft Q function is trained to minimize the soft Bellman residual:

J Q (w) with gradient: \nabla w J Q (w) = E (s t, a t) \sim D [

where $\bar{ψ}$

SAC updates the policy to minimize the KL-divergence:

π new objective for update: J π (θ) = arg min π' \in Π D KL (π' (.

where $Π$

This update guarantees that $Q^{π_{new}} (s_{t}, a_{t}) \geq Q^{π_{old}} (s_{t}, a_{t})$

Once we have defined the objective functions and gradients for soft action-state value, soft state value and the policy network, the soft actor-critic algorithm is straightforward:

SAC

Fig. 6. The soft actor-critic algorithm. (Image source: original paper)

SAC with Automatically Adjusted Temperature

[paper|code]

SAC is brittle with respect to the temperature parameter. Unfortunately it is difficult to adjust temperature, because the entropy can vary unpredictably both across tasks and during training as the policy becomes better. An improvement on SAC formulates a constrained optimization problem: while maximizing the expected return, the policy should satisfy a minimum entropy constraint:

max π 0, \dots, π T E [\sum t = 0 T r (s t, a t)] s.t. \forall t, H

where $H_{0}$

The expected return $E [\sum_{t = 0}^{T} r (s_{t}, a_{t})]$

max π 0 (E [r (s 0, a 0)] + max π 1 (

where we consider $γ = 1$

So we start the optimization from the last timestep $T$

maximize E (s T, a T) \sim ρ π [r (s T, a T)] s.t. H (π T) -

First, let us define the following functions:

h (π T) f (π T) = H (π T) - H 0 = E (s T,

And the optimization becomes:

maximize f (π T) s.t. h (π T) \geq 0

To solve the maximization optimization with inequality constraint, we can construct a Lagrangian expression with a Lagrange multiplier (also known as “dual variable”), $α_{T}$

L (π T, α T) = f (π T) + α T h (π T)

Considering the case when we try to minimize $L (π_{T}, α_{T})$ - given a particular value $π_{T}$

If the constraint is satisfied, $h (π_{T}) \geq 0$
If the constraint is invalidated, $h (π_{T}) < 0$

In either case, we can recover the following equation,

f (π T) = min α T \geq 0 L (π T, α T)

At the same time, we want to maximize $f (π_{T})$

max π T f (π T) = min α T \geq 0 max π T L (π T, α T)

Therefore, to maximize $f (π_{T})$

max π T E [r (s T, a T)] = max π T f

We could compute the optimal $π_{T}$

π * T α * T = arg max π T E (s T, a T)

Thus, max π T E [r (s T, a T)] = E (s T, a T) \sim ρ π

Now let’s go back to the soft Q value function:

Q T - 1 (s T - 1, a T - 1) Q * T - 1 (s T

Therefore the expected return is as follows, when we take one step further back to the time step $T - 1$

max π T - 1 (E [r (s T - 1, a T - 1

Similar to the previous step,

π * T - 1 α * T - 1 = arg max π T - 1

The equation for updating $α_{T - 1}$

J (α) = E a t \sim π t [- α log π t (a t ∣ π t) - α H 0]

The final algorithm is same as SAC except for learning $α$

SAC2

Fig. 7. The soft actor-critic algorithm with automatically adjusted temperature. (Image source: original paper)

TD3

[paper|code]

The Q-learning algorithm is commonly known to suffer from the overestimation of the value function. This overestimation can propagate through the training iterations and negatively affect the policy. This property directly motivated Double Q-learning and Double DQN: the action selection and Q-value update are decoupled by using two value networks.

Twin Delayed Deep Deterministic (short for TD3; Fujimoto et al., 2018) applied a couple of tricks on DDPG to prevent the overestimation of the value function:

(1) Clipped Double Q-learning: In Double Q-Learning, the action selection and Q-value estimation are made by two networks separately. In the DDPG setting, given two deterministic actors $(μ_{θ_{1}}, μ_{θ_{2}})$

y 1 y 2 = r + γ Q w 2 (s', μ θ 1 (s')) =

However, due to the slow changing policy, these two networks could be too similar to make independent decisions. The Clipped Double Q-learning instead uses the minimum estimation among two so as to favor underestimation bias which is hard to propagate through training:

y 1 y 2 = r + γ min i = 1, 2 Q w i (s', μ θ 1 (

(2) Delayed update of Target and Policy Networks: In the actor-critic model, policy and value updates are deeply coupled: Value estimates diverge through overestimation when the policy is poor, and the policy will become poor if the value estimate itself is inaccurate.

To reduce the variance, TD3 updates the policy at a lower frequency than the Q-function. The policy network stays the same until the value error is small enough after several updates. The idea is similar to how the periodically-updated target network stay as a stable objective in DQN.

(3) Target Policy Smoothing: Given a concern with deterministic policies that they can overfit to narrow peaks in the value function, TD3 introduced a smoothing regularization strategy on the value function: adding a small amount of clipped random noises to the selected action and averaging over mini-batches.

y ϵ = r + γ Q w (s', μ θ (s') + ϵ) \sim clip (N (0,

This approach mimics the idea of SARSA update and enforces that similar actions should have similar values.

Here is the final algorithm:

TD3

Fig 8. TD3 Algorithm. (Image source: Fujimoto et al., 2018)

SVPG

[paper|code for SVPG]

Stein Variational Policy Gradient (SVPG; Liu et al, 2017) applies the Stein variational gradient descent (SVGD; Liu and Wang, 2016) algorithm to update the policy parameter $θ$

In the setup of maximum entropy policy optimization, $θ$

J^(θ) = E θ \sim q [J (θ)] - α D KL (q ∥ q 0)

where $E_{θ \sim q} [R (θ)]$

If we don’t have any prior information, we might set $q_{0}$

J^(θ) = E θ \sim q [J (θ)] - α D KL (q ∥ q 0)

Let’s take the derivative of $\hat{J} (θ) = E_{θ \sim q} [J (θ)] - α D_{KL} (q ‖ q_{0})$

\nabla q J^(θ) = \nabla q (E θ \sim q [J (θ)] - α

The optimal distribution is:

log q * (θ) = 1 α J ( θ ) + log q 0 ( θ ) - 1 thus q * ( θ ) ������

The temperature $α$

When using the SVGD method to estimate the target posterior distribution $q (θ)$

θ i \leftarrow θ i + ϵ ϕ * (θ i) where ϕ * = max ϕ \in H {- \nabla ϵ D KL (q'

where $ϵ$

Comparing different gradient-based update methods:

Method	Update space
Plain gradient	$Δ θ$
Natural gradient	$Δ θ$
SVGD	$Δ θ$

One estimation of $ϕ^{*}$

ϕ * (θ i) = E ϑ \sim q' [\nabla ϑ log q (ϑ) k (ϑ, θ i

The first term in red encourages $θ_{i}$
The second term in green pushes particles away from each other and therefore diversifies the policy. => to be dissimilar to other particles

SVPG

Usually the temperature $α$

Quick Summary

After reading through all the algorithms above, I list a few building blocks or principles that seem to be common among them:

Try to reduce the variance and keep the bias unchanged to stabilize learning.
Off-policy gives us better exploration and helps us use data samples more efficiently.
Experience replay (training data sampled from a replay memory buffer);
Target network that is either frozen periodically or updated slower than the actively learned policy network;
Batch normalization;
Entropy-regularized reward;
The critic and actor can share lower layer parameters of the network and two output heads for policy and value functions.
It is possible to learn with deterministic policy rather than stochastic one.
Put constraint on the divergence between policy updates.
New optimization methods (such as K-FAC).
Entropy maximization of the policy helps encourage exploration.
Try not to overestimate the value function.
TBA more.

Cited as:

@article{weng2018PG,
  title   = "Policy Gradient Algorithms",
  author  = "Weng, Lilian",
  journal = "lilianweng.github.io/lil-log",
  year    = "2018",
  url     = "https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html"
}

References

[1] jeremykun.com Markov Chain Monte Carlo Without all the Bullshit

[2] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction; 2nd Edition. 2017.

[3] John Schulman, et al. “High-dimensional continuous control using generalized advantage estimation.” ICLR 2016.

[4] Thomas Degris, Martha White, and Richard S. Sutton. “Off-policy actor-critic.” ICML 2012.

[5] timvieira.github.io Importance sampling

[6] Mnih, Volodymyr, et al. “Asynchronous methods for deep reinforcement learning.” ICML. 2016.

[7] David Silver, et al. “Deterministic policy gradient algorithms.” ICML. 2014.

[8] Timothy P. Lillicrap, et al. “Continuous control with deep reinforcement learning.” arXiv preprint arXiv:1509.02971 (2015).

[9] Ryan Lowe, et al. “Multi-agent actor-critic for mixed cooperative-competitive environments.” NIPS. 2017.

[10] John Schulman, et al. “Trust region policy optimization.” ICML. 2015.

[11] Ziyu Wang, et al. “Sample efficient actor-critic with experience replay.” ICLR 2017.

[12] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. “Safe and efficient off-policy reinforcement learning” NIPS. 2016.

[13] Yuhuai Wu, et al. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation.” NIPS. 2017.

[14] kvfrans.com A intuitive explanation of natural gradient descent

[15] Sham Kakade. “A Natural Policy Gradient.”. NIPS. 2002.

[16] “Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients.” - Seita’s Place, Mar 2017.

[17] “Notes on the Generalized Advantage Estimation Paper.” - Seita’s Place, Apr, 2017.

[18] Gabriel Barth-Maron, et al. “Distributed Distributional Deterministic Policy Gradients.” ICLR 2018 poster.

[19] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv preprint arXiv:1801.01290 (2018).

[20] Scott Fujimoto, Herke van Hoof, and Dave Meger. “Addressing Function Approximation Error in Actor-Critic Methods.” arXiv preprint arXiv:1802.09477 (2018).

[21] Tuomas Haarnoja, et al. “Soft Actor-Critic Algorithms and Applications.” arXiv preprint arXiv:1812.05905 (2018).

[22] David Knowles. “Lagrangian Duality for Dummies” Nov 13, 2010.

[23] Yang Liu, et al. “Stein variational policy gradient.” arXiv preprint arXiv:1704.02399 (2017).

[24] Qiang Liu and Dilin Wang. “Stein variational gradient descent: A general purpose bayesian inference algorithm.” NIPS. 2016.

查看全文

相关阅读:
objective-c保护属性
 第十七章、程序管理与 SELinux 初探工作管理 (job control)
第十七章、程序管理与 SELinux 初探
 Shell运算符：Shell算数运算符、关系运算符、布尔运算符、字符串运算符等
 go语言初始化内部结构体3中方式
 数据结构之C语言模拟整数数组实现
 使用python将元组转换成列表，并替换其中元素
 ruby中的类实例变量和实例的实例变量
 读<<programming ruby>> 7.6节 flip-flop 理解
 ruby逻辑判断符号

原文地址：https://www.cnblogs.com/wangxiaocvpr/p/11617854.html