Homework 8

In this homework, we'll compare L2 regularization with weight decay, when used with Adam. Moreover, we'll try out our MLP + PBT + Adam approach on the emotions-sadness-joy text classification dataset.

Add optional L2 regularization to the Optimizer base class that you wrote in Notebook 0326. Probably the easiest way to go is to update the gradient of each parameter in-place at the beginning of step or _update_parameter.
Recreate the AdamW optimizer class. You can just use the code you wrote in Notebook 0326. Why you need to recreate it is that you changed its parent class.
Load the dataset and use features coming from word vectors just like in Notebook 0305. I recommend saving the feature matrices you get from averaging word vectors so that in multiple runs you don't have to rerun preprocessing.
Perform two training runs of pbt, with population size 64 and validation interval 1000:
1. One should have L2 regularization turned on and weight decay turned off.
2. The other should have L2 regularization turned off and weight decay turned on.
After each run:
1. Print the best validation binary accuracy you got.
2. Make line plots with confidence bands of:
  1. 1 - log10 of the first and second moment moving average weight decays and
  2. log10 of the learning rate, epsilon and the L2 regularization coefficient or the weight decay rate, depending on the run.