Skip to content

4/2 Proximal Policy Optimization (PPO)

Recall that by the Policy Gradient Theorem, we have

\[ \nabla_\theta\mathbf E_\theta G_0 = \mathbf E_\theta Q_\pi(S, A)\nabla_\theta\log\pi(A|S). \]

So, in an optimization algorithm based on this result, we need to provide some approximation to the action-value function \(Q_\pi\).

In REINFORCE, we

  1. sample a full trajectory \(s_0, a_0, r_1, s_1,\dotsc, r_T, s_T\) and then
  2. take the discounted returns \(\hat g_t=\sum_{k=0}^{T-t-1}\gamma^k r_{t+k+1}\) as estimates of the values \(Q_\pi(s_t, a_t)\).

Today, we'll study Proximal Policy Optimization (PPO) [1], one of the most popular on-policy DRL algorithms. Its policy update rule derives from another theoretical approach, which we now present.

Conservative Policy Iteration

For policies \(\pi\) and \(\pi'\), we have [2, Appendix A, Lemma 1]

$$ \mathbf E_{\pi'} G_0 = \mathbf E_\pi G_0 + \mathbf E_{\pi'} \sum_{t=0}^\infty \gamma^t A_\pi(S_t, A_t) $$ where \(A_\pi(S, A) = Q_\pi(S, A) - V_\pi(S)\) is the advantage function: it signifies the expected advantage of choosing action \(A\) in state \(S\).

Our interest is that given a parametrized policy \(\pi_{\theta_0}\), we want to figure out what to make an updated parameter \(\theta\). Therefore, sampling from \(\mathbf E_{\pi_\theta}\) is not feasible. However, if we sample from the original policy:

$$ L^{CPI}(\theta) = \mathbf E_{\theta_0}\sum_{t=0}^\infty\gamma^t r_t(\theta) A_{\pi_{\theta_0}}(S_t,A_t), $$ where \(r_t(\theta)=\frac{\pi_\theta(A_t|S_t)}{\pi_{\theta_0}(A_t|S_t)}\), not to be confused with the reward at timestep \(t\), then we get an approximation to first degree [3, Section 4.1]:

$$ \nabla_\theta\mathbf E_\theta G_0=\nabla_\theta L^{CPI}(\theta). $$ Here, CPI stands for Conservative Policy Iteration.

Clipped Surrogate Objective

So, a sufficiently small gradient descent step in the direction \(\nabla_\theta L^{CPI}(\theta)\) improves the objective \(\mathbf E_\theta G_0\).

PPO offers an easy to implement function, the gradients of which give such steps:

$$ L^{CLIP}(\theta)=\mathbf E_{\theta_0}\sum_{t=0}^\infty\gamma^t \min( r_t(\theta)A_{\theta_0}(S_t,A_t), \mathop{\mathrm{clip}}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_{\theta_0}(S_t,A_t) ) $$ where:

  1. we have \(\mathop{\mathrm{clip}}(x, a, b)=\min(\max(x, a), b)\) and
  2. the clip coefficient \(\epsilon\) is a new hyperparameter (not to be confused with the \(\epsilon\) in Adam).

Generalized Advantage Estimation [4]

Actor-Critic Algorithms

To implement an algorithm using \(L^{CLIP}\), we need to find a way to estimate the advantage \(A_\pi(S, A)\). To this end, we shall introduce a value function \(V\) that estimates the true value function \(V_\pi\). That is, we will be using a so-called actor-critic algorithm:

  1. the actor is the policy and
  2. the critic is the value function.

Monte Carlo Advantage

We can adapt the estimate \(Q_\pi(S_t,A_t)\approx\hat g_t\) we used in REINFORCE to a Monte Carlo sampling advantage estimate

\[ A_\pi(S_t,A_t)\approx\hat g_t-V(S_t). \]

TD Residual

On the other end of the spectrum, we have the so-called temporal difference (TD) residual

\[ \delta_t^V = r_{t+1} + \gamma V(S_{t+1}) - V(S_t). \]

\(k\)-Step TD

Monte Carlo samples have lower bias but higher variance, TD residuals vice versa. To balance between these extremes, we can use \(k\)-step TD residuals

\[ \hat A^{(k)}_t:=\sum_{l=0}^{k-1}\gamma^l\delta_{t+l}^V =-V(S_t)+r_t+\gamma r_{t+1}+\dotsb+\gamma^{k-1}r_{t+k-1} + \gamma^k V(S_{t+k}). \]

Generalized Advantage Estimator

A Generalized Advantage Estimator (GAE) is an exponentially weighted average of the \(k\)-step TD residuals:

\[ \hat A^{\mathrm{GAE}(\gamma,\lambda)}=(1-\lambda) (\hat A_t^{(1)} + \lambda\hat A^{(2)}_t + \lambda^2\hat A^{(3)}_t+\dotsb) =\sum_{l=0}^\infty(\gamma\lambda)^l\delta^V_{t+l}. \]

We have a new hyperparameter \(\lambda\in[0,1]\): the GAE coefficient.

PPO Training Algorithm

PPO Training proceeds in iterations. Each iteration is the sequence of the following steps:

  1. Given our present policy and value functions, which we denote by \(\pi_{\theta_\mathrm{old}}\) and \(V_{\theta_\mathrm{old}}\) we run a number of environments in parallel to generate a number of partial trajectories

    \[ s_t,a_t,r_{t+1},s_{t+1},\dotsc,r_u,s_u. \]
    1. To support batching, when an environment terminates, we immediately reset it and output the new starting observation as the next step.
  2. On this data, we calculate the advantage esimates \(\hat A^{\mathrm{GAE}(\gamma,\lambda)}_t\).

  3. We train for a given number of epochs on this dataset.

    1. The value function is trained with a similarly clipped loss function as the policy objective [5]:

    $$ L^V(\theta)=\min( (V_\theta-V_\mathrm{targ})^2, (\mathop{\mathrm{clip}}( V_\theta,V_\mathrm{old}-\epsilon,V_\mathrm{old}+\epsilon ) - V_\mathrm{targ})^2) ) $$ where we have, at timestep \(t\le k< u\):

    \[ V_\mathrm{targ} = \begin{cases} \sum_{l=k}^{T-1}\gamma^l r_l & T\le u\text{ terminal} \\ \sum_{l=k}^{u-1}\gamma^l r_l + V_\mathrm{old}(s_u) & \text{otherwise.} \end{cases} \]

References

[1] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford and Oleg Klimov: Proximal Policy Optimization Algorithms, 2017. link

[2] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan and Philipp Moritz: Trust Region Policy Optimization, 2015. Proceedings of the 32nd International Conference on Machine Learning, PMLR vol. 37, pp. 1889--1897. link

[3] Sham Kakade and John Langford: Approximately Optimal Approximate Reinforcement Learning, 2002. Proceedings of the Nineteenth International Conference on Machine Learning (ICML '02). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 267–274. link

[4] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan and Pieter Abbeel: High-Dimensional Continuous Control Using Generalized Advantage Estimation, 2016. International Conference on Learning Representations (ICLR). link

[5] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph and Aleksander Madry: Implementation Matters in Deep RL: A Case Study on PPO and TRPO, 2019. International Conference on Learning Representations (ICLR). link