Homework 7

A simple regularization method we'll try out now is weight decay: we penalize large parameteres. Formally, this means to add a term $\lambda\|\theta\|^2$ to the loss function, where $\lambda$ is a hyperparameter, usually called weight decay itself and the squared lengths of the parameters are taken componentwise. In implementation, it is easier to replace the gradient descent step

$$ \theta\leftarrow\theta - \eta\nabla_\theta, $$ where $\eta$ is the learning rate, by

\[ \theta\leftarrow\theta - \eta(\nabla_\theta + \lambda \theta). \]

In the function pbt you wrote in Notebook 0321, update the gradient descent step. For each parameter, reshape the weight decay tensor, like you did in case of the learning rate tensor.
Run an experiment to measure the effect of weight decay:
1. Update the configuration dictionary: add to the hyperparameter_raw_init_distributions, hyperparameter_raw_perturb and hyperparameter_transforms dictionaries a new key-value pair each, with key weight_decay and values the same as those for the learning rate (we don't know what magnitude to use, so we just let PBT find it out).
2. Load the MNIST dataset and normalize its features, just like in Notebook 0321.
3. Create an ensemble of ReLU MLPs of 3 hidden layers with 128 dimensions each.
4. Run the updated training loop.
Report experiment results:
1. Print the best accuracy you saw during training.
2. Plot the schedules PBT finds for the learning rate and the weight decay with confidence bands.