Homework 9

In the PBT paper [1], in reinforcement learning tasks, instead of Welch's \(t\)-test, they use the following exploitation method: they replace the worst 20% of the population with copies of the best 20%, while perturbing the hyperparameters. Let's see for ourselves how the two methods compare when using REINFORCE. Since, as I said in Lecture 3/28, Cart Pole [2] is too easy a task to benchmark most Deep Reinforcement Learning algorithms, we'll use Lunar Lander [3] -- be sure to check the generated videos!

To install the Lunar Lander environment (while keeping the previously installed Cart Pole), use the following two installation steps:
```
pip install swig
pip install gymnasium[box2d,toy-text]
```

To compare the two exploitation methods on Lunar Lander, make the following modifications to Notebook 0328:

Change the following in the configuration dictionary:
1. Change "env_id" to "LunarLander-v3".
2. Create a new key "exploit_method" that lets us switch between the two methods "truncation" and "welch". The initial value is irrelevant as we'll try out both in the assignment.
3. Create a new key "truncate_proportion" that controls what proportion of worst performing population members are replaced by copies of the best performing population members. In the paper, this is set to .2; feel free to check out another value.
4. Optionally, lower "welch_confidence_level". I ran a small number of experiments and I got the best results for 70%-90%.

We start the function pbt_update, as given in Notebook 0328, by getting the pair source_mask, target_indices that determines exploitation.

Refactor getting this pair into a function call to a new function

def exploit_welch(
    config: dict,
    evaluations: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Exploitation in Population-Based Training using Welch's t-test.

    Parameters
    ----------
    config : `dict`
        Configuration dictionary. Required key-value pair:
        `"welch_confidence_level"` : `float`
            Confidence level to use in Welch's t-test.
    evaluations : `torch.Tensor`
        Tensor of evaluations. We assume it has shape
        `(sample_size, population_size)`.

    Algorithm
    ---------
    1. Sample a tensor `target_indices` of shape `(population_size,)`
    of population member indices.
    2. Create a tensor `source_mask` of shape `(population_size,)`
    that says for which population member index `i`
    cannot Welch's t-test with `config['welch_confidence_level']`
    refute the hypothesis that the evaluations
    coming from `target_indices[i]`
    can be expected to be better that the evaluations coming from `i`,
    based on the appropriate subtensors of `evaluations` as samples.


    Returns
    -------
    The pair `source_indices, target_indices`, where
    1. `source_indices` is the tensor of indices
    where `source_mask` is `True`.
    2. `target_indices` is the collection of target indices as defined above
    masked by the source mask.
    """

Please note that most of the function is already written in pbt_update in Notebook 0328. You only need to refactor the code and change the output.

Delete the part of pbt_update where you update the log dictionary items "source_mask" and "target_indices". (In the new setup, you can get source_indices and target_indices of varying size, which thus can't be stacked.)
Adapt the rest of pbt_update to the change from using source_mask, target_indices to source_indices, target_indices, where target_indices now only contains the target indices at source_indices.

Give a switch based on config["exploit_method"] to pbt_update such that if that is not welch but truncation, then the pair source_indices, target_indices should be gotten from the following function:

def exploit_truncation(
    config: dict,
    evaluations: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor]:    
    """
    Exploitation in Population-Based Training using trunction.

    Parameters
    ----------
    config : `dict`
        Configuration dictionary. Required key-value pair:
        `'truncate_proportion'` : `float`
            Proportion of population members with worst estimated performance
            to be replaced by population members with
            best estimated performance.
    evaluations : `torch.Tensor`
        Tensor of evaluations. We assume it has shape
        `(sample_size, population_size)`.

    Algorithm
    ---------
    1. Rank the population members based on the averages in `evaluations`.
    2. Based on this ranking, select the `config['truncate_proportion'`]`
    of `population_size`, rounded to the nearest integer,
    of population members with the worst performance
    for replacement by the same number of population members
    with the best performance.

    Returns
    -------
    The pair `source_indices, target_indices`, where
    1. `source_indices` is the tensor of population member indices
    to be replaced, and
    2. `target_indices` is the tensor of population member indices
    to serve as replacement.
    """

Please note that in this case, you need to write this function.

Make two training runs, one with exploitation by truncation, another with exploitation by Welch's \(t\)-test. Compare the best performing population members per evaluation:
1. You can use the function reinforce in Notebook 0328 for this, once you updated pbt_update as above.
2. For the two runs, adjust config["exploit_method"] accordingly.
3. The function reinforce outputs a log dictionary. At its "evaluations" key, it has a tensor of shape (evaluation_num, sample_size, population_size) that stores evaluations. You can take its average along the sample dimension, and then the maximum along the population dimension, to get a vector of best average evaluations per evaluation step.
4. Get this vector for both exploitation methods and compare them by making one line plot of each on the same canvas.

Note

You can find more involved versions of the landing problem on the coding puzzle and bot competition website CodinGame: 1 2 3 o

References

[1] Jack Parker-Holder, Vu Nguyen and Stephen J. Roberts: Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits, 2020. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). link

Dataset References

[2] Andrew G. Barto; Richard S. Sutton and Charles W. Anderson: Neuronlike adaptive elements that can solve difficult learning control problems, 1983. IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13 (5), pp. 834--846, doi: 10.1109/TSMC.1983.6313077. link

[3] Oleg Klimov: Lunar Lander. link