3/28 REINFORCE

Setup: Continuous State Space

Today, we'll cover the first algorithm that tries to deal with MDP's, where the state space $\mathscr S$ is continuous. Note that in this case, tabular methods such as $Q$-learning are infeasible as there are infinitely many states.

Example: Cart Pole

The Hello World problem of Deep Reinforcement Learning is Cart Pole [1]:
https://gymnasium.farama.org/environments/classic_control/cart_pole/
Here, you have to balance a pole standing on a cart.

The state space has 4 components:
1. Cart Position,
2. Cart Velocity,
3. Pole Angle and
4. Pole Angular Velocity.
The action space is still discrete:
1. We can either push the cart to the left or
2. We can push it to the right
with a fixed amount.
We terminate the episode in each of the following cases:
1. The pole angle is outside $[-12^\circ, 12^\circ]$,
2. The cart position is outside $[-2.4, 2.4]$ or
3. 500 steps passed.
At each step, we get a reward of 1. Thus, the maximum reward is 500.

Policy Learning

Today, we'll use a policy learning method. That is, we will directly optimize a policy $$ \mathscr S\xrightarrow{\pi_\theta}\mathscr P(\mathscr A). $$ Note that as the state space $\mathscr S$ is continuous, we'll use a parametric policy with parameter $\theta$.

Policy Gradient Theorem

Recall that one of the most important performance measures of a policy $\pi_\theta$ is the expected discounted return from the initial state. That is, we want to maximize the following optimization objective: $$ J(\theta)=\mathbf E_\theta G_0, $$ where we let $\mathbf E_\theta=\mathbf E_{\pi_\theta}$, for simplicity.

Previously, we discussed how it is very difficult to approximate the expected discounted return $G_0$. Thus, to optimize this objective, we want to rewrite its gradient in a more tractable form.

Theorem (Policy Gradient Theorem) [2, Theorem 1] There exists a constant $C>0$ such that for all parameter values $\theta$, we have

\[\begin{align*} \nabla_\theta J &=C\mathbf E_{S\sim d^{\pi_\theta}} \sum_{a\in\mathscr A}q_\pi(S, A)\nabla_\theta\pi_\theta(A|S) \\ &=C\mathbf E_{S\sim d^{\pi_\theta},\,A\sim\pi_\theta(S)} q_\pi(S, A)\nabla_\theta\log\pi_\theta(A|S), \end{align*}\]

where $d^{\pi_\theta}$ denotes the distribution of states when following $\pi_\theta$.

Algorithms that optimize a parametric policy via GD are called policy gradient methods.

REINFORCE

Today, we'll cover the simplest policy gradient method: REINFORCE: REward Increment = Nonnegative Factor $\times$ Offset Reinforcement $\times$ Characteristic Eligibility [3].

An update step in the algorithm is as follows:

We generate a full trajectory $$ s_0, a_0, r_1, s_1, \dotsc, r_T, s_T $$ following $\pi_\theta$.
We use discounted returns as estimated action-state values: $$ q_\pi(s_t,a_t)\approx\sum_{k=1}^{T-t}\gamma^{k-1} r_{t+k}=:g_t. $$ This is called a Monte Carlo estimate of $q_\pi(s_t,a_t)$.
We perform an optimization step with the gradient we get as per the Policy Gradient Theorem (we take the negative as we optimize with loss gradients): $$ \nabla_\theta = -\frac{1}{T}\sum_{t=0}^{T-1}g_t\nabla_\theta\log\pi_\theta(a_t|s_t) $$

Now we can explain the terms in the acronym REINFORCE:

Reward increment refers to $-\nabla_\theta$ as is is an approximation of the direction of steepest ascent $\nabla J(\theta)$.
Nonnegative factor is the learning rate.
Offset reinforcement is the factor $g_t-b(s_t)$. Here, $b(s_t)$ is an optional baseline. For now, we set this to 0.
Characteristic eligibility is the policy gradient $\nabla_\theta\pi_\theta(a|s)$.

Action Sampling with the Gumbel-Max Trick

In case the action space is finite: $|\mathscr A|=c$, we can give an action distribution as a collection of unnormalized logits:

\[ f_\theta(s)\in\mathbf R^c. \]

To sample an action, the straightforward approach would be to:

Convert the unnormalized logits to probabilities using the softmax function.
Sample actions from the categorical distribution with the given probabilities.

Fortunately, there is a more direct way to go about this:

Theorem (Gumbel-Max trick) [4, Equation 2]. Let $Z$ be a categorical distribution with unnormalized logits $z_1,\dotsc,z_k$. Let $G_1,\dotsc,G_k$ independent random variables with the standard Gumbel distribution. Then we have

\[ Z\sim\mathop{\mathrm{argmax}}\{G_i+z_i:i=1,\dotsc,c\}. \]

So, to sample from the action distribution defined by $f_\theta(s)$, all we need to do is:

Sample $k$ values from the standard Gumbel distribution.
Add the values to $f_\theta(s)$.
Take the argmax.

Aside: Gumbel-Top-$k$ Trick

While we're on the subject:

\[ \mathop{\mathrm{arg top}}k\{G_i+z_i:i=1,\dotsc,c\} \]

has the same distribution as sampling $k$ entries from $Z$ without replacement [5, 2.4]. So that also can be done on a GPU in a parallelizable manner!

References

[2] Richard S. Sutton, David McAllester, Satinder Singh and Yishay Mansour: Policy Gradient Methods for Reinforcement Learning with Function Approximation, 1999. Advances in Neural Information Processing Systems 12 (NIPS 1999). link

[3] Ronald J. Williams: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, 1992. Machine Learning, vol. 8, pp. 229--256. doi: 10.1023/A:1022672621406, link

[4] Chris J. Maddison, Daniel Tarlow and Tom Minka: $A^*$ Sampling, 2014. Advances in Neural Information Processing Systems 27 (NIPS 2014). link

[5] Wouter Kool, Herke van Hoof and Max Welling: Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement, 2019. The Thirty-Sixth International Conference on Machine Learning (ICML). link

Dataset References

[1] Andrew G. Barto; Richard S. Sutton and Charles W. Anderson: Neuronlike adaptive elements that can solve difficult learning control problems, 1983. IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13 (5), pp. 834--846, doi: 10.1109/TSMC.1983.6313077. link