Homework 7
A simple regularization method we'll try out now is weight decay: we penalize large parameteres. Formally, this means to add a term \(\lambda\|\theta\|^2\) to the loss function, where \(\lambda\) is a hyperparameter, usually called weight decay itself and the squared lengths of the parameters are taken componentwise. In implementation, it is easier to replace the gradient descent step
$$ \theta\leftarrow\theta - \eta\nabla_\theta, $$ where \(\eta\) is the learning rate, by
\[
\theta\leftarrow\theta - \eta(\nabla_\theta + \lambda \theta).
\]
- In the function
pbtyou wrote in Notebook 0321, update the gradient descent step. For each parameter, reshape the weight decay tensor, like you did in case of the learning rate tensor. - Run an experiment to measure the effect of weight decay:
- Update the configuration dictionary: add to the
hyperparameter_raw_init_distributions,hyperparameter_raw_perturbandhyperparameter_transformsdictionaries a new key-value pair each, with keyweight_decayand values the same as those for the learning rate (we don't know what magnitude to use, so we just let PBT find it out). - Load the MNIST dataset and normalize its features, just like in Notebook 0321.
- Create an ensemble of ReLU MLPs of 3 hidden layers with 128 dimensions each.
- Run the updated training loop.
- Update the configuration dictionary: add to the
- Report experiment results:
- Print the best accuracy you saw during training.
- Plot the schedules PBT finds for the learning rate and the weight decay with confidence bands.