4/2 Proximal Policy Optimization (PPO)
Recall that by the Policy Gradient Theorem, we have
So, in an optimization algorithm based on this result, we need to provide some approximation to the action-value function \(Q_\pi\).
In REINFORCE, we
- sample a full trajectory \(s_0, a_0, r_1, s_1,\dotsc, r_T, s_T\) and then
- take the discounted returns \(\hat g_t=\sum_{k=0}^{T-t-1}\gamma^k r_{t+k+1}\) as estimates of the values \(Q_\pi(s_t, a_t)\).
Today, we'll study Proximal Policy Optimization (PPO) [1], one of the most popular on-policy DRL algorithms. Its policy update rule derives from another theoretical approach, which we now present.
Conservative Policy Iteration
For policies \(\pi\) and \(\pi'\), we have [2, Appendix A, Lemma 1]
$$ \mathbf E_{\pi'} G_0 = \mathbf E_\pi G_0 + \mathbf E_{\pi'} \sum_{t=0}^\infty \gamma^t A_\pi(S_t, A_t) $$ where \(A_\pi(S, A) = Q_\pi(S, A) - V_\pi(S)\) is the advantage function: it signifies the expected advantage of choosing action \(A\) in state \(S\).
Our interest is that given a parametrized policy \(\pi_{\theta_0}\), we want to figure out what to make an updated parameter \(\theta\). Therefore, sampling from \(\mathbf E_{\pi_\theta}\) is not feasible. However, if we sample from the original policy:
$$ L^{CPI}(\theta) = \mathbf E_{\theta_0}\sum_{t=0}^\infty\gamma^t r_t(\theta) A_{\pi_{\theta_0}}(S_t,A_t), $$ where \(r_t(\theta)=\frac{\pi_\theta(A_t|S_t)}{\pi_{\theta_0}(A_t|S_t)}\), not to be confused with the reward at timestep \(t\), then we get an approximation to first degree [3, Section 4.1]:
$$ \nabla_\theta\mathbf E_\theta G_0=\nabla_\theta L^{CPI}(\theta). $$ Here, CPI stands for Conservative Policy Iteration.
Clipped Surrogate Objective
So, a sufficiently small gradient descent step in the direction \(\nabla_\theta L^{CPI}(\theta)\) improves the objective \(\mathbf E_\theta G_0\).
PPO offers an easy to implement function, the gradients of which give such steps:
$$ L^{CLIP}(\theta)=\mathbf E_{\theta_0}\sum_{t=0}^\infty\gamma^t \min( r_t(\theta)A_{\theta_0}(S_t,A_t), \mathop{\mathrm{clip}}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_{\theta_0}(S_t,A_t) ) $$ where:
- we have \(\mathop{\mathrm{clip}}(x, a, b)=\min(\max(x, a), b)\) and
- the clip coefficient \(\epsilon\) is a new hyperparameter (not to be confused with the \(\epsilon\) in Adam).
Generalized Advantage Estimation [4]
Actor-Critic Algorithms
To implement an algorithm using \(L^{CLIP}\), we need to find a way to estimate the advantage \(A_\pi(S, A)\). To this end, we shall introduce a value function \(V\) that estimates the true value function \(V_\pi\). That is, we will be using a so-called actor-critic algorithm:
- the actor is the policy and
- the critic is the value function.
Monte Carlo Advantage
We can adapt the estimate \(Q_\pi(S_t,A_t)\approx\hat g_t\) we used in REINFORCE to a Monte Carlo sampling advantage estimate
TD Residual
On the other end of the spectrum, we have the so-called temporal difference (TD) residual
\(k\)-Step TD
Monte Carlo samples have lower bias but higher variance, TD residuals vice versa. To balance between these extremes, we can use \(k\)-step TD residuals
Generalized Advantage Estimator
A Generalized Advantage Estimator (GAE) is an exponentially weighted average of the \(k\)-step TD residuals:
We have a new hyperparameter \(\lambda\in[0,1]\): the GAE coefficient.
PPO Training Algorithm
PPO Training proceeds in iterations. Each iteration is the sequence of the following steps:
-
Given our present policy and value functions, which we denote by \(\pi_{\theta_\mathrm{old}}\) and \(V_{\theta_\mathrm{old}}\) we run a number of environments in parallel to generate a number of partial trajectories
\[ s_t,a_t,r_{t+1},s_{t+1},\dotsc,r_u,s_u. \]- To support batching, when an environment terminates, we immediately reset it and output the new starting observation as the next step.
-
On this data, we calculate the advantage esimates \(\hat A^{\mathrm{GAE}(\gamma,\lambda)}_t\).
-
We train for a given number of epochs on this dataset.
- The value function is trained with a similarly clipped loss function as the policy objective [5]:
$$ L^V(\theta)=\min( (V_\theta-V_\mathrm{targ})^2, (\mathop{\mathrm{clip}}( V_\theta,V_\mathrm{old}-\epsilon,V_\mathrm{old}+\epsilon ) - V_\mathrm{targ})^2) ) $$ where we have, at timestep \(t\le k< u\):
\[ V_\mathrm{targ} = \begin{cases} \sum_{l=k}^{T-1}\gamma^l r_l & T\le u\text{ terminal} \\ \sum_{l=k}^{u-1}\gamma^l r_l + V_\mathrm{old}(s_u) & \text{otherwise.} \end{cases} \]
References
[1] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford and Oleg Klimov: Proximal Policy Optimization Algorithms, 2017. link
[2] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan and Philipp Moritz: Trust Region Policy Optimization, 2015. Proceedings of the 32nd International Conference on Machine Learning, PMLR vol. 37, pp. 1889--1897. link
[3] Sham Kakade and John Langford: Approximately Optimal Approximate Reinforcement Learning, 2002. Proceedings of the Nineteenth International Conference on Machine Learning (ICML '02). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 267–274. link
[4] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan and Pieter Abbeel: High-Dimensional Continuous Control Using Generalized Advantage Estimation, 2016. International Conference on Learning Representations (ICLR). link
[5] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph and Aleksander Madry: Implementation Matters in Deep RL: A Case Study on PPO and TRPO, 2019. International Conference on Learning Representations (ICLR). link