4/2 Proximal Policy Optimization (PPO)

Recall that by the Policy Gradient Theorem, we have

\[ \nabla_\theta\mathbf E_\theta G_0 = \mathbf E_\theta Q_\pi(S, A)\nabla_\theta\log\pi(A|S). \]

So, in an optimization algorithm based on this result, we need to provide some approximation to the action-value function $Q_\pi$.

In REINFORCE, we

sample a full trajectory $s_0, a_0, r_1, s_1,\dotsc, r_T, s_T$ and then
take the discounted returns $\hat g_t=\sum_{k=0}^{T-t-1}\gamma^k r_{t+k+1}$ as estimates of the values $Q_\pi(s_t, a_t)$.

Today, we'll study Proximal Policy Optimization (PPO) [1], one of the most popular on-policy DRL algorithms. Its policy update rule derives from another theoretical approach, which we now present.

Conservative Policy Iteration

For policies $\pi$ and $\pi'$, we have [2, Appendix A, Lemma 1]

$$ \mathbf E_{\pi'} G_0 = \mathbf E_\pi G_0 + \mathbf E_{\pi'} \sum_{t=0}^\infty \gamma^t A_\pi(S_t, A_t) $$ where $A_\pi(S, A) = Q_\pi(S, A) - V_\pi(S)$ is the advantage function: it signifies the expected advantage of choosing action $A$ in state $S$.

Our interest is that given a parametrized policy $\pi_{\theta_0}$, we want to figure out what to make an updated parameter $\theta$. Therefore, sampling from $\mathbf E_{\pi_\theta}$ is not feasible. However, if we sample from the original policy:

$$ L^{CPI}(\theta) = \mathbf E_{\theta_0}\sum_{t=0}^\infty\gamma^t r_t(\theta) A_{\pi_{\theta_0}}(S_t,A_t), $$ where $r_t(\theta)=\frac{\pi_\theta(A_t|S_t)}{\pi_{\theta_0}(A_t|S_t)}$, not to be confused with the reward at timestep $t$, then we get an approximation to first degree [3, Section 4.1]:

$$ \nabla_\theta\mathbf E_\theta G_0=\nabla_\theta L^{CPI}(\theta). $$ Here, CPI stands for Conservative Policy Iteration.

Clipped Surrogate Objective

So, a sufficiently small gradient descent step in the direction $\nabla_\theta L^{CPI}(\theta)$ improves the objective $\mathbf E_\theta G_0$.

PPO offers an easy to implement function, the gradients of which give such steps:

$$ L^{CLIP}(\theta)=\mathbf E_{\theta_0}\sum_{t=0}^\infty\gamma^t \min( r_t(\theta)A_{\theta_0}(S_t,A_t), \mathop{\mathrm{clip}}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_{\theta_0}(S_t,A_t) ) $$ where:

we have $\mathop{\mathrm{clip}}(x, a, b)=\min(\max(x, a), b)$ and
the clip coefficient $\epsilon$ is a new hyperparameter (not to be confused with the $\epsilon$ in Adam).

Generalized Advantage Estimation [4]

Actor-Critic Algorithms

To implement an algorithm using $L^{CLIP}$, we need to find a way to estimate the advantage $A_\pi(S, A)$. To this end, we shall introduce a value function $V$ that estimates the true value function $V_\pi$. That is, we will be using a so-called actor-critic algorithm:

the actor is the policy and
the critic is the value function.

Monte Carlo Advantage

We can adapt the estimate $Q_\pi(S_t,A_t)\approx\hat g_t$ we used in REINFORCE to a Monte Carlo sampling advantage estimate

\[ A_\pi(S_t,A_t)\approx\hat g_t-V(S_t). \]

TD Residual

On the other end of the spectrum, we have the so-called temporal difference (TD) residual

\[ \delta_t^V = r_{t+1} + \gamma V(S_{t+1}) - V(S_t). \]

$k$-Step TD

Monte Carlo samples have lower bias but higher variance, TD residuals vice versa. To balance between these extremes, we can use $k$-step TD residuals

\[ \hat A^{(k)}_t:=\sum_{l=0}^{k-1}\gamma^l\delta_{t+l}^V =-V(S_t)+r_t+\gamma r_{t+1}+\dotsb+\gamma^{k-1}r_{t+k-1} + \gamma^k V(S_{t+k}). \]

Generalized Advantage Estimator

A Generalized Advantage Estimator (GAE) is an exponentially weighted average of the $k$-step TD residuals:

\[ \hat A^{\mathrm{GAE}(\gamma,\lambda)}=(1-\lambda) (\hat A_t^{(1)} + \lambda\hat A^{(2)}_t + \lambda^2\hat A^{(3)}_t+\dotsb) =\sum_{l=0}^\infty(\gamma\lambda)^l\delta^V_{t+l}. \]

We have a new hyperparameter $\lambda\in[0,1]$: the GAE coefficient.

PPO Training Algorithm

PPO Training proceeds in iterations. Each iteration is the sequence of the following steps:

Given our present policy and value functions, which we denote by $\pi_{\theta_\mathrm{old}}$ and $V_{\theta_\mathrm{old}}$ we run a number of environments in parallel to generate a number of partial trajectories

\[ s_t,a_t,r_{t+1},s_{t+1},\dotsc,r_u,s_u. \]
1. To support batching, when an environment terminates, we immediately reset it and output the new starting observation as the next step.
On this data, we calculate the advantage esimates $\hat A^{\mathrm{GAE}(\gamma,\lambda)}_t$.
We train for a given number of epochs on this dataset.
1. The value function is trained with a similarly clipped loss function as the policy objective [5]:
$$ L^V(\theta)=\min( (V_\theta-V_\mathrm{targ})^2, (\mathop{\mathrm{clip}}( V_\theta,V_\mathrm{old}-\epsilon,V_\mathrm{old}+\epsilon ) - V_\mathrm{targ})^2) ) $$ where we have, at timestep $t\le k< u$:

\[ V_\mathrm{targ} = \begin{cases} \sum_{l=k}^{T-1}\gamma^l r_l & T\le u\text{ terminal} \\ \sum_{l=k}^{u-1}\gamma^l r_l + V_\mathrm{old}(s_u) & \text{otherwise.} \end{cases} \]

References

[1] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford and Oleg Klimov: Proximal Policy Optimization Algorithms, 2017. link

[2] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan and Philipp Moritz: Trust Region Policy Optimization, 2015. Proceedings of the 32nd International Conference on Machine Learning, PMLR vol. 37, pp. 1889--1897. link

[3] Sham Kakade and John Langford: Approximately Optimal Approximate Reinforcement Learning, 2002. Proceedings of the Nineteenth International Conference on Machine Learning (ICML '02). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 267–274. link

[4] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan and Pieter Abbeel: High-Dimensional Continuous Control Using Generalized Advantage Estimation, 2016. International Conference on Learning Representations (ICLR). link

[5] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph and Aleksander Madry: Implementation Matters in Deep RL: A Case Study on PPO and TRPO, 2019. International Conference on Learning Representations (ICLR). link