2/26 Early Stopping and Grid Search

Today, we'll learn a new stopping condition and the simplest hyperparameter optimization method.

Selecting the Best Model

In a training procedure, performance can deteriorate after having reached a best score. A common reason for this is overfitting, which we'll investigate later. For this, it is best not to keep the model at hand at the end of the training, but the one with the best validation result.

Here's how to keep the model parameters with the best seen validation for each ensemble member: Let $E$ denote the set of ensemble indices. That is, if ensemble_shape=(n_1,...,n_l) then $E=[n_1]\times\dotsb\times[n_l]$. Moreover, let us assume that the greater the metric score, the better it views the model performance. Otherwise, please adjust accordingly.

We can initialize the following:
1. A collection of best scores $(s'_e:e\in E)$, initialized to a value at most the worst possible evaluation score. For example, if the metric is accuracy, these are the nonpositive numbers.
2. A collection of best parameters $(\theta'_e:e\in E)$, initialized arbitrarily.
At each validation:
1. We get a collection of new scores $(s_e:e\in E)$ and current parameters $(\theta_e:e\in E)$.
2. We want to update the best scores and parameters where $s_e$ is better than $s'_e$.
3. To account for floating point errors, we only think of $s_e$ as better than $s'_e$, if it is better by some improvement threshold $\delta>0$: $s_e > s'_e + \delta$: $$ s'_e\leftarrow s_e\text{ and }\theta'_e\leftarrow\theta_e \text{ for }e\in E\text{ such that }s_e > s'_e + \delta $$

Early Stopping

Keeping track of when we got an improved model gives rise to a new stopping condition: we can stop training if we got no improved model for a given number $C$ of training steps.

To this end, we can change the best model selector above as follows:

At initialization, also create a counter $c$ for training steps without improvement. Start this at 0.
At validation:
1. If we got improvement for some $e\in E$, then reset $c$ to 0.
2. Otherwise:
  1. Add to $c$ the number of training steps since the last validation.
  2. If $c\ge C$, then stop training.

Masking

For updating best scores and best parameters, we can use masking, wherein we index into a tensor with a Boolean tensor.

Let $[n_1]\times\dotsb\times[n_k]\xrightarrow tK$ be an arbitrary tensor and $[n_1]\times\dotsb\times[n_l]\xrightarrow b\{\mathtt{True}, \mathtt{False}\}$ a Boolean tensor for some $l\le k$. Let $\mathscr I=(I_j:0\le j<r)$ denote the collection of tuples $I_j=(i_1^{(j)},\dotsc,i_l^{(j)})\in[n_1]\times\dotsb\times[n_l]$ such that $b[i_1^{(j)},\dotsc,i_l^{(j)}]$. Then $a[b]$ is the composite $$ [r]\times[n_{l+1}]\times\dotsb\times[n_k] \xrightarrow{(j,i_{l+1},\dotsc,i_k)\mapsto(i_1^{(j)},\dotsc,i_l^{(j)},i_{l+1},\dotsc,i_k)} [n_1]\times\dotsb\times[n_k]\xrightarrow aK. $$ That is, in its first dimension, $a[b]$ lists the subtensors that $b$ selects.

In torch, we can get $r$ as b.sum() and $\mathscr I$ as b.nonzero().

Info

Follow this link for more information on masking:
https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing

Grid Search

A simple hyperparameter search option that we'll use today is Grid Search. In it:

We select a finite number of choices we want to try out for a number of hyperparameters and
We evaluate each combination of the choices.

For example, in the lab today, we'll try out:

the learning rates $\{10^i:i=-2, -1.5,\dotsc,1\}$ and
the minibatch sizes $\{2^i:i=3,4,\dotsc,10\}$.

That is, we'll try out $7\cdot8=56$ hyperparameter configurations.