2/12 Multinomial Logistic Regression I

Final Project

In the final project, you'll explore and present a Deep Learning topic on your own. See Syllabus for a list of potentional topics. You are very welcome to bring your own topic! In that case, please write me your suggestion.

Topic Selection: due 2/28

You are very welcome to work in groups! In that case, I'll ask you to make it clear who covered which part.

If you are the first to choose a topic, please raise an issue in the class repository about it.
If you want to work on a topic someone else already raised an issue about, please write a comment in the issue.

Please choose a topic by February 28, 6:00pm.

Presentation Proposal: due 3/28

In this phase, I expect you to explore the topic you chose. By March 28, 6:00pm, please send me a presentation proposal of at least 200 words per group member. It should include:

A brief description of the topic you chose.
At least 3 references you'll use to prepare your presentation.
An experiment proposal. Usually, this can be a simple modeling experiment, where you set up a model and train it on a relevant dataset.

In the week of March 31, I'll meet with each group to discuss the presentations.

Written Presentation: First Version due 4/25, Final Version due 5/9

You should send me your written presentation of at least 500 words per group member by April 25, 6:00pm. This should include:

A detailed description of the topic.
At least 3 references.
Discussion of your experiment.

You should also send me the code you used to run your experiment.

In the week of April 28, I'll meet with each group to discuss the written and oral presentations. Based on this discussion, you can change your written presentation and send me an updated version by May 9, 6:00pm. You'll get your written presentation score based on the final version.

If you hand in the final version of your written presentation less than 5 days late, then your score will be multiplied by $$ \cos\left(p\frac{\pi}{2}\right)\text{ where }p=\frac{\text{time since written presentation was due}}{\text{5 days}}. $$

Oral Presentation: Week of 5/5

You will have to present your topics in class during the week of May 5. The talks should be about 10-15 minutes long per group member. You'll get your oral presentation score based on your talk.

Final Score

You final score will be the sum of your homework score, your written presentation score and your oral presentation score. Out of these three, the best score will weight 40%, the other two 30%.

Classification

Recall that in case the target space $\mathscr Y$ is finite, we are dealing with a classification problem. In this case, unless stated otherwise, we let $\mathscr Y=\{0,\dotsc,c-1\}=:[c]$.

Metric: Accuracy

Now we shall undertake creating a model that tries to predict the classes from the feature vectors. First, we need to decide how we'll measure the performance of the model. A straightforward choice is accuracy: the proportion of times the model correctly predicted the label.

Why Model Class Distribution?

One may ask, if our actual interest codified in our metric is only concerned about the predictions themselves, why do we want to model the class distributions $\mathbf P(Y=j:X=\mathbf x)$ instead of just picking a class $j\in[c]$ for each feature vector $\mathbf x$?

We already discussed that predicted class probabilities constitute a richer output than just the prediction of the most probable class. There is moreover a technical reason for outputting predicted class probabilities:

The target space $\mathscr Y=[c]$ is a finite and thus discrete object. If our model was a map $\mathscr X\to \mathscr Y=[c]$ then it would be impossible to continuously deform it. This would severely limit the number of optimization techniques available. In particular, we couldn't use Gradient Descent, the most basic form of the optimization method used in the vast majority of Deep Learning.

Model: Multinomial Logistic Regression

Categorical Distributions and the Probability Simplex

As the target space $\mathscr Y$ is a finite set of $c$ elements, a probability distribution $p\in\mathscr P([c])$ is a categorical distribution. Such a distribution models a situation where there are $c$ distinct events, and exactly one occurs. This is given by the probabilities $p_j: 0\le j<c$, which need to satisfy the following criteria:

$0\le p_j\le 1$. Each value is a probability.
$\sum_{j=0}^{c-1}p_j=1$. Of the $c$ distinct events, exactly one occurs.

Geometrically, the possible $c$-tuples $(p_0,\dotsc,p_{c-1})$ of the values make up the standard $c$-simplex: $$ \Delta^c=\left\{(p_0,\dotsc,p_{c-1})\in\mathbf R^c_{\ge0}:\sum_{j=0}^{c-1}p_j=1\right\} $$ Therefore, we also call this object $\Delta^c$ the probability $c$-simplex.

Mapping Into the Simplex: Softmax

We want to use as model an affine transformation. As we're trying to get $c$ probability values, the model should have $c$-dimensional output. Thus, we want to use as parameter set the pair of:

a weight matrix $W\in\mathbf R^{d\times c}$ and
a bias vector $\mathbf b\in\mathbf R^c$.

Now the function $\mathbf x\mapsto\mathbf x^TW + \mathbf b$ maps into $\mathbf R^c$ without further constraints. We should map these arbitrary $c$-dimensional vectors into the probability $c$-simplex $\Delta^c$. The canonical map to do so is the softmax function: $$ \mathrm{softmax}(z_j:0\le j<c) =\left(\frac{\exp(z_j)}{\sum_{j'=0}^{c-1}\exp(z_{j'})}:0\le j<c\right). $$ The values $z_i$ that will get transformed to probabilities are called logits, for reasons that we will explain when discussing non-multinomial, that is binary classification.

In total, our model, the Multinomial Logistic Regression function is the composite $$ \mathbf R^d\xrightarrow{\mathbf x\mapsto\mathbf x^TW + \mathbf b} \mathbf R^c\xrightarrow{\mathrm{softmax}} \Delta^c, $$ where the parameters are $\theta=(W,\mathbf b)$. To emphasize that the outputs are probabilities, we will denote this model by $p_\theta$.

Optimization: Gradient Descent

As we mentioned last week, for most optimization problems, the best solution is not known. Thus, we need to approximate it. The family of optimization methods that rose to close to exclusive status in Deep Learning is that of gradient descent-based methods.

These methods have two requirements:

The collection of parameters should be a vector in some Euclidean space $\theta\in\mathbf R^N$.
As a function of the parameter vector, the loss function $$ \mathbf R^N\xrightarrow{\theta\mapsto\ell(\mathscr D_\text{train}; \theta)}\mathbf R $$ should be piecewise differentiable.

The basic form of gradient descent: full batch gradient descent works as follows:

We randomly initialize the parameters $\theta$. The distribution from which we draw the parameters is important: this is the issue of initialization.
We iteratively modify $\theta$. At each step, called a training step:
1. We calculate the gradient $\nabla_\theta\ell(\mathscr D_\text{train};\theta)$. Its opposite points in the direction of greatest decrease of loss at parameter $\theta$.
2. We update $\theta$ by a scalar multiple of the gradient: $$ \theta\leftarrow\theta-\eta\cdot\nabla_\theta\ell(\mathscr D_\text{train};\theta) $$ The scalar $\eta$ is called the learning rate. It is a hyperparameter: that is, it is not a parameter of the model, but a parameter of the learning process.
We iterate until a stopping condition is met. First, we'll use the most basic stopping condition: we perform a fixed number of training steps. Note that this fixed number is our second hyperparameter.

At this point, you may wonder if you'll have to calculate gradients of loss functions in your code. The good news is that you won't: modern Deep Learning libraries such as pytorch include an automatic differentiation feature. This is what we'll try out in today's lab.