Homework 9
In the PBT paper [1], in reinforcement learning tasks, instead of Welch's \(t\)-test, they use the following exploitation method: they replace the worst 20% of the population with copies of the best 20%, while perturbing the hyperparameters. Let's see for ourselves how the two methods compare when using REINFORCE. Since, as I said in Lecture 3/28, Cart Pole [2] is too easy a task to benchmark most Deep Reinforcement Learning algorithms, we'll use Lunar Lander [3] -- be sure to check the generated videos!
-
To install the Lunar Lander environment (while keeping the previously installed Cart Pole), use the following two installation steps:
pip install swig pip install gymnasium[box2d,toy-text] -
To compare the two exploitation methods on Lunar Lander, make the following modifications to Notebook 0328:
-
Change the following in the configuration dictionary:
- Change
"env_id"to"LunarLander-v3". - Create a new key
"exploit_method"that lets us switch between the two methods"truncation"and"welch". The initial value is irrelevant as we'll try out both in the assignment. - Create a new key
"truncate_proportion"that controls what proportion of worst performing population members are replaced by copies of the best performing population members. In the paper, this is set to.2; feel free to check out another value. - Optionally, lower
"welch_confidence_level". I ran a small number of experiments and I got the best results for 70%-90%.
- Change
-
We start the function
pbt_update, as given in Notebook 0328, by getting the pairsource_mask, target_indicesthat determines exploitation.-
Refactor getting this pair into a function call to a new function
def exploit_welch( config: dict, evaluations: torch.Tensor ) -> tuple[torch.Tensor, torch.Tensor]: """ Exploitation in Population-Based Training using Welch's t-test. Parameters ---------- config : `dict` Configuration dictionary. Required key-value pair: `"welch_confidence_level"` : `float` Confidence level to use in Welch's t-test. evaluations : `torch.Tensor` Tensor of evaluations. We assume it has shape `(sample_size, population_size)`. Algorithm --------- 1. Sample a tensor `target_indices` of shape `(population_size,)` of population member indices. 2. Create a tensor `source_mask` of shape `(population_size,)` that says for which population member index `i` cannot Welch's t-test with `config['welch_confidence_level']` refute the hypothesis that the evaluations coming from `target_indices[i]` can be expected to be better that the evaluations coming from `i`, based on the appropriate subtensors of `evaluations` as samples. Returns ------- The pair `source_indices, target_indices`, where 1. `source_indices` is the tensor of indices where `source_mask` is `True`. 2. `target_indices` is the collection of target indices as defined above masked by the source mask. """Please note that most of the function is already written in
pbt_updatein Notebook 0328. You only need to refactor the code and change the output. -
Delete the part of
pbt_updatewhere you update thelogdictionary items"source_mask"and"target_indices". (In the new setup, you can getsource_indicesandtarget_indicesof varying size, which thus can't be stacked.) -
Adapt the rest of
pbt_updateto the change from usingsource_mask, target_indicestosource_indices, target_indices, wheretarget_indicesnow only contains the target indices atsource_indices.
-
-
Give a switch based on
config["exploit_method"]topbt_updatesuch that if that is notwelchbuttruncation, then the pairsource_indices, target_indicesshould be gotten from the following function:def exploit_truncation( config: dict, evaluations: torch.Tensor ) -> tuple[torch.Tensor, torch.Tensor]: """ Exploitation in Population-Based Training using trunction. Parameters ---------- config : `dict` Configuration dictionary. Required key-value pair: `'truncate_proportion'` : `float` Proportion of population members with worst estimated performance to be replaced by population members with best estimated performance. evaluations : `torch.Tensor` Tensor of evaluations. We assume it has shape `(sample_size, population_size)`. Algorithm --------- 1. Rank the population members based on the averages in `evaluations`. 2. Based on this ranking, select the `config['truncate_proportion'`]` of `population_size`, rounded to the nearest integer, of population members with the worst performance for replacement by the same number of population members with the best performance. Returns ------- The pair `source_indices, target_indices`, where 1. `source_indices` is the tensor of population member indices to be replaced, and 2. `target_indices` is the tensor of population member indices to serve as replacement. """Please note that in this case, you need to write this function.
-
-
Make two training runs, one with exploitation by truncation, another with exploitation by Welch's \(t\)-test. Compare the best performing population members per evaluation:
- You can use the function
reinforcein Notebook 0328 for this, once you updatedpbt_updateas above. - For the two runs, adjust
config["exploit_method"]accordingly. - The function
reinforceoutputs a log dictionary. At its"evaluations"key, it has a tensor of shape(evaluation_num, sample_size, population_size)that stores evaluations. You can take its average along the sample dimension, and then the maximum along the population dimension, to get a vector of best average evaluations per evaluation step. - Get this vector for both exploitation methods and compare them by making one line plot of each on the same canvas.
- You can use the function
Note
You can find more involved versions of the landing problem on the coding puzzle and bot competition website CodinGame: 1 2 3 o
References
[1] Jack Parker-Holder, Vu Nguyen and Stephen J. Roberts: Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits, 2020. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). link
Dataset References
[2] Andrew G. Barto; Richard S. Sutton and Charles W. Anderson: Neuronlike adaptive elements that can solve difficult learning control problems, 1983. IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13 (5), pp. 834--846, doi: 10.1109/TSMC.1983.6313077. link
[3] Oleg Klimov: Lunar Lander. link