2/14 Multinomial Logistic Regression II

Dataset: MNIST

The MNIST dataset is a collection of handwritten digits.

The features are image pixels. Each image is a 28*28 grayscale image with 8 bit color resolution. That is there are $28\cdot28=784$ features, each of which is an integer between 0 and 255 (inclusive).
The targets are the digits. That is, we have 10 classes and thus $\mathscr Y=\{0,\dotsc,9\}=[10]$.

Loss: Negative Log Likelihood

We discussed previously that to measure the performance of our model, we use accuracy. As our model outputs class probabilities, to get the predicted class, we can take the class index of the highest probability:

\[ \text{predicted class} = \mathrm{argmax}\,p_\theta(\mathbf x)=j:p_\theta(\mathbf x)_j=\max_{j=0}^{c-1}p_\theta(\mathbf x). \]

Note that to get the argmax, we need not take softmax:

$$ \mathrm{argmax}\,p_\theta(\mathbf x)=\mathrm{argmax}\,\mathbf x^TW + \mathbf b. $$ If we let minus accuracy be the loss function, as we evaluate on a finite dataset, we get a locally constant function. We can't use gradient descent with a such a loss functions, as gradients are zero, wherever they are defined.

Part of the problem is that accuracy is insensitive to the predicted probabilities. So we turn to a function that is sensitive to it. A common way for that in statistics is likelihood: the cumulative probability of our dataset given our model:

$$ \mathbf P(\mathscr D_\text{train}|p_\theta)=\prod_{i=0}^{n_\text{train}-1}p_\theta(\mathbf x)_{y_i} $$ Our optimization objective is to maximize this value. For it is strictly monotonous, applying the $\log$ function does not change the objective. We get the Log Likelihood

$$ \mathrm{LL}(\mathscr D_\text{train};\theta) =\sum_{i=0}^{n_\text{train}-1}\log p_\theta(\mathbf x)_{y_i}. $$ As this is a maximization problem, and we want to get a minimization problem, we take its negative, thus get the Negative Log Likelihood:

\[ \mathrm{NLL}(\mathscr D_\text{train};\theta)=-\mathrm{LL}(\mathscr D_\text{train};\theta). \]

Accordingly, our loss term for one entry is:

\[ \ell_\mathrm{NLL}(\mathbf x, y;\theta)=-\log p_\theta(\mathbf x)_{y_i}. \]

In Machine Learning, it is customary to average the losses across a dataset instead of summing them. So we get:

$$ \ell_\mathrm{NLL}(\mathscr D_\text{train}; \theta) =-\frac{1}{n_\text{train}}\sum_{i=0}^{n_\text{train}-1}\log p_\theta(\mathbf x)_{y_i}. $$ In Information Theory, this formula calculates cross-entropy. For a nice explanation of what that means, I can recommend Christopher Olah: Visual Information Theory

Hyperparameter Optimization and Train-Valid-Test Split

So, let's get optimizing!

Recall that our training procedure has hyperparameters. So far, we've seen two of these: learning rate and number of training steps. How should we set them?

This is a separate optimization problem, called Hyperparameter Optimization (Tuning): we input the hyperparameters, train the model on the train set and finally evaluate the model on the test set. But how are we going to decide if our hyperparameters are special to the union of the train and test sets, or the model we train with them generalizes to unseen data?

For this, we need a third split of the dataset: a validation set. This is the train set for the hyperparameters, so to speak. That is, in the hyperparameter optimization problem:

The input is a selection of hyperparameters.
With these hyperparameters, we train a model on the train set.
The output is the evaluation of the model on the validation set.

We evaluate the best hyperparameters, that is those training with which gave the model with the best validation results, on the test set. We're not supposed to touch hyperparameters or model parameters afterwards.

Hyperparameter optimization is a very different optimization problem from parameter optimization, that is model training:

there are much fewer hyperparameters than parameters, but
evaluating a hyperparameter set takes an entire model training process, that is a lot of time.

So, very different optimization methods are called for. In the near future, we'll consider the most basic approaches: grid search and random search. Later on, we might check out more sophisticated methods.

It is customary to make the validation set have about the same size than the test set. that is:

If we only have one dataset, we make a 70%-15%-15% or 80%-10%-10% train-valid-test split.
If we have a train and a test set, then we randomly split as many entries off the train set for the validation set as there are test set entries.