3/21 Population-Based Training

We have two sorts of hyperparameters:

Those that affect the initial model, such as
1. Architecture: number of layers, layers widths, ...
2. Initialization: distribution type, distribution parameters
Those that affect training, such as
1. Learning rate
2. Batch size
3. Weight decay, to be introduced in Homework 7.
Moreover, these latter hyperparameters can be scheduled¹, that is instead of being kept constant, varied across training, as in Homework 6. Then the schedule itself comes with hyperparameters.

Hyperparameter Optimization

You can see that as the models and the training process get more sophisticated, it becomes infeasible to test all hyperparameter combinations with a grid search. Thus, we can use Hyperparameter Optimization Algorithms to try to restrict search to more promising directions in the hyperparameter space.

Offline Hyperparameter Optimization

One type of hyperparameter optimization is to

Sample a hyperparameter combination.
Run training with it.
Use the results to update the sampling distribution.

For example, one can use Bayesian Optimization (BO) for this. See [1] for an early example and [2] for a later, more refined one. One can also combine BO with local search, see eg. [3].

Online Hyperparameter Optimization

An online hyperparameter optimization method adjusts the hyperparameters while training is in progress. Thus, for example it can find hyperparameter schedules on its own. We will discuss in detail one such algorithm today: Population-Based Training [4]. This can also be combined with BO [5]

Pruning

A middle ground between online and offline hyperparameter optimization is to

start experiments with many hyperparameter configuration samples, then
periodically shut down the ones that don't show promise in comparison to the others.

A simple such idea is Hyperband [6]. Once can combine this with BO to resample hyperparameters based on our growing knowledge about them, see [7] and [8].

Details of PBT

General Idea

The general idea of PBT is to

run a population of parallel training processes and
periodically replace the worst samples with copies of the best ones, while
perturbing the hyperparameters.

The main advantages of this approach are:

Highly parallelizable. This is in comparison to offline methods, where you have to wait for the previous training to finish to be able to make your next sample better.
Finds hyperparameter schedules on its own. This cuts the number of hyperparameters.
By evolving the better performing models, we are also attenuating the dependence on lucky initializations.

Configuration Space, Initial Hyperparameters

First, we need to select a configuration space for the hyperparameters we wish to tune. We will sample the hyperparameters for the initial population from this.

Today, we'll only tune the learning rate. As we don't know if this should be \(10^{-5}\) or \(1\), we will let the initial distribution be \(10^{\mathscr U([-5, 0])}\), that is we sample from \([-5, 0]\) uniformly and exponentiate with base 10.

We separate the two steps: we sample raw hyperparameters from \(\mathscr U([-5, 0])\), then we apply the hyperparameter transform \(x\mapsto 10^x\).

Exploitation by Welch's \(t\)-Test

We exploit better partial results by periodically replacing worse performing models with better ones. Note that it is an important heuristic choice how to do this: here too you have to balance training speed and getting stuck in local optima.

In the paper, they use the following method in case of supervised learning:

For each population member, we choose a member to compare to randomly.
We apply Welch's \(t\)-test [9] to the last 10 evaluations of the two members, to see if the latter one can be expected to be better.
Based on the test, we replace the members with the comparison members.

We give the formula for Welch's \(t\)-test in the lab.

Exploration by Perturbation

We explore hyperparameter choices by perturbing the hyperparameters of the copies of population members that worse performing members got replaced with. In the paper, they suggest multiplying each hyperparameter by a random choice of 0.8 or 1.2. I find this approach too rigid. So we will add random noise to the raw hyperparameters.

References

[1] James Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl: Algorithms for Hyper-Parameter Optimization, 2011. Advances in Neural Information Processing Systems 24 (NIPS 2011). link

[2] Alexander I. Cowen-Rivers, Wenlong Lyu, Rasul Tutunov, Zhi Wang, Antoine Grosnit, Ryan Rhys Griffiths, Alexandre Max Maraval, Hao Jianye, Jun Wang, Jan Peters and Haitham Bou Ammar: HEBO: Pushing The Limits of Sample-Efficient Hyper-parameter Optimisation, 2022. Journal of Artificial Intelligence Research (JAIR), Volume 74, pp. 1269--1349. link

[3] Chi Wang, Qingyun Wu, Silu Huang and Amin Saied: Economic Hyperparameter Optimization With Blended Search Strategy, 2021. International Conference on Learning Representations (ICLR) 2021. linky

[4] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando and Koray Kavukcuoglu: Population Based Training of Neural Networks, 2017. link

[5] Jack Parker-Holder, Vu Nguyen and Stephen J. Roberts: Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits, 2020. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). link

[6] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh and Ameet Talwalkar: Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization, 2018. Journal of Machine Learning Research (JMLR), Volume 18 (185), pp. 1--52. link

[7] Stefan Falkner, Aaron Klein and Frank Hutter: BOHB: Robust and Efficient Hyperparameter Optimization at Scale, 2018. Proceedings of the 35th International Conference on Machine Learning, (PMLR) Volume 80, pp. 1437--1446. link

[8] Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhkopf, René Sass and Frank Hutter: SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization, 2022. Journal of Machine Learning Research (JMLR), Volume 23 (54), pp. 1--9. link

[9] Bernard Lewis Welch: The generalisation of `Student's' problems when several different population variances are involved, 1947. Biometrika, Volume 34 (1--2), pp. 28-35. link

Batch size is usually not scheduled, but set to some value that makes use of all available GPU memory. ↩