Skip to content

3/28 REINFORCE

Setup: Continuous State Space

Today, we'll cover the first algorithm that tries to deal with MDP's, where the state space \(\mathscr S\) is continuous. Note that in this case, tabular methods such as \(Q\)-learning are infeasible as there are infinitely many states.

Example: Cart Pole

The Hello World problem of Deep Reinforcement Learning is Cart Pole [1]:
https://gymnasium.farama.org/environments/classic_control/cart_pole/
Here, you have to balance a pole standing on a cart.

  1. The state space has 4 components:
    1. Cart Position,
    2. Cart Velocity,
    3. Pole Angle and
    4. Pole Angular Velocity.
  2. The action space is still discrete:

    1. We can either push the cart to the left or
    2. We can push it to the right

    with a fixed amount.

  3. We terminate the episode in each of the following cases:

    1. The pole angle is outside \([-12^\circ, 12^\circ]\),
    2. The cart position is outside \([-2.4, 2.4]\) or
    3. 500 steps passed.
  4. At each step, we get a reward of 1. Thus, the maximum reward is 500.

Policy Learning

Today, we'll use a policy learning method. That is, we will directly optimize a policy $$ \mathscr S\xrightarrow{\pi_\theta}\mathscr P(\mathscr A). $$ Note that as the state space \(\mathscr S\) is continuous, we'll use a parametric policy with parameter \(\theta\).

Policy Gradient Theorem

Recall that one of the most important performance measures of a policy \(\pi_\theta\) is the expected discounted return from the initial state. That is, we want to maximize the following optimization objective: $$ J(\theta)=\mathbf E_\theta G_0, $$ where we let \(\mathbf E_\theta=\mathbf E_{\pi_\theta}\), for simplicity.

Previously, we discussed how it is very difficult to approximate the expected discounted return \(G_0\). Thus, to optimize this objective, we want to rewrite its gradient in a more tractable form.

Theorem (Policy Gradient Theorem) [2, Theorem 1] There exists a constant \(C>0\) such that for all parameter values \(\theta\), we have

\[\begin{align*} \nabla_\theta J &=C\mathbf E_{S\sim d^{\pi_\theta}} \sum_{a\in\mathscr A}q_\pi(S, A)\nabla_\theta\pi_\theta(A|S) \\ &=C\mathbf E_{S\sim d^{\pi_\theta},\,A\sim\pi_\theta(S)} q_\pi(S, A)\nabla_\theta\log\pi_\theta(A|S), \end{align*}\]

where \(d^{\pi_\theta}\) denotes the distribution of states when following \(\pi_\theta\).

Algorithms that optimize a parametric policy via GD are called policy gradient methods.

REINFORCE

Today, we'll cover the simplest policy gradient method: REINFORCE: REward Increment = Nonnegative Factor \(\times\) Offset Reinforcement \(\times\) Characteristic Eligibility [3].

An update step in the algorithm is as follows:

  1. We generate a full trajectory $$ s_0, a_0, r_1, s_1, \dotsc, r_T, s_T $$ following \(\pi_\theta\).
  2. We use discounted returns as estimated action-state values: $$ q_\pi(s_t,a_t)\approx\sum_{k=1}^{T-t}\gamma^{k-1} r_{t+k}=:g_t. $$ This is called a Monte Carlo estimate of \(q_\pi(s_t,a_t)\).
  3. We perform an optimization step with the gradient we get as per the Policy Gradient Theorem (we take the negative as we optimize with loss gradients): $$ \nabla_\theta = -\frac{1}{T}\sum_{t=0}^{T-1}g_t\nabla_\theta\log\pi_\theta(a_t|s_t) $$

Now we can explain the terms in the acronym REINFORCE:

  1. Reward increment refers to \(-\nabla_\theta\) as is is an approximation of the direction of steepest ascent \(\nabla J(\theta)\).
  2. Nonnegative factor is the learning rate.
  3. Offset reinforcement is the factor \(g_t-b(s_t)\). Here, \(b(s_t)\) is an optional baseline. For now, we set this to 0.
  4. Characteristic eligibility is the policy gradient \(\nabla_\theta\pi_\theta(a|s)\).

Action Sampling with the Gumbel-Max Trick

In case the action space is finite: \(|\mathscr A|=c\), we can give an action distribution as a collection of unnormalized logits:

\[ f_\theta(s)\in\mathbf R^c. \]

To sample an action, the straightforward approach would be to:

  1. Convert the unnormalized logits to probabilities using the softmax function.
  2. Sample actions from the categorical distribution with the given probabilities.

Fortunately, there is a more direct way to go about this:

Theorem (Gumbel-Max trick) [4, Equation 2]. Let \(Z\) be a categorical distribution with unnormalized logits \(z_1,\dotsc,z_k\). Let \(G_1,\dotsc,G_k\) independent random variables with the standard Gumbel distribution. Then we have

\[ Z\sim\mathop{\mathrm{argmax}}\{G_i+z_i:i=1,\dotsc,c\}. \]

So, to sample from the action distribution defined by \(f_\theta(s)\), all we need to do is:

  1. Sample \(k\) values from the standard Gumbel distribution.
  2. Add the values to \(f_\theta(s)\).
  3. Take the argmax.

Aside: Gumbel-Top-\(k\) Trick

While we're on the subject:

\[ \mathop{\mathrm{arg top}}k\{G_i+z_i:i=1,\dotsc,c\} \]

has the same distribution as sampling \(k\) entries from \(Z\) without replacement [5, 2.4]. So that also can be done on a GPU in a parallelizable manner!

References

[2] Richard S. Sutton, David McAllester, Satinder Singh and Yishay Mansour: Policy Gradient Methods for Reinforcement Learning with Function Approximation, 1999. Advances in Neural Information Processing Systems 12 (NIPS 1999). link

[3] Ronald J. Williams: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, 1992. Machine Learning, vol. 8, pp. 229--256. doi: 10.1023/A:1022672621406, link

[4] Chris J. Maddison, Daniel Tarlow and Tom Minka: \(A^*\) Sampling, 2014. Advances in Neural Information Processing Systems 27 (NIPS 2014). link

[5] Wouter Kool, Herke van Hoof and Max Welling: Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement, 2019. The Thirty-Sixth International Conference on Machine Learning (ICML). link

Dataset References

[1] Andrew G. Barto; Richard S. Sutton and Charles W. Anderson: Neuronlike adaptive elements that can solve difficult learning control problems, 1983. IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13 (5), pp. 834--846, doi: 10.1109/TSMC.1983.6313077. link