2/7 Linear Regression

Today, we'll go through the "Hello World" of machine learning: linear regression on a small dataset.

Course Logistics: Homework

You will be assigned your first homework today. You will get homework assigments roughly once per week. Each homework will be due 2 weeks after the day it was announced, at 10:00 am. A precise schedule will be provided. You can expect me to answer a question related to a homework if asked at least 24 hours prior to this time.

If you hand in your homework less than a week late, then your score will be multiplied by $$ \cos\left(p\frac{\pi}{2}\right)\text{ where }p = \frac{\text{time since homework was due}}{\text{1 week}}. $$ That is, for example:

If you hand in your homework 1 day late, your score will be multiplied by $\cos\frac{\pi}{14}\approx0.975$.
If you hand in your homework 2.5 days late, your score will be multiplied by $\cos\frac{5\pi}{28}\approx0.847$.
If you hand in your homework 6 days late, your score will be multiplied by $\cos\frac{3\pi}{7}\approx0.223$.

You cannot hand in your homework more than a week late.

Your worst homework score will be dropped.

Please hand in homework via email, in py files. The filename should have the pattern {name}-{homework id}.py. For example, I would submit the third homework as pal-zsamboki-3.py. Please include in your submission any custom modules you import from.

You are strongly encouraged to discuss homework problems with other students in the course, however, you are required to write your own solutions. You are supposed to understand your own solutions in full detail.

If you rely on sources different from the course notes, it is expected that you name these sources and the extent you have made use of them. In particular, if you rely on a virtual assistant (chatbot), please include the conversation with your submission.

Dataset: Abalone

We'll use the Abalone dataset. In this dataset, we're trying to predict the age of abalone through 8 other measurements such as sex or length.

Preprocessing: One-Hot Encoding

Out of the 8 measurements, we have 7 scalar features. On the other hand, sex is a categorical feature with 3 possible values: female, infant and male. Therefore, the feature space is a subset of the product $\mathbf R^7\times\{F, I, M\}$.

To be able to use linear regression, we need to convert this feature space to a subset of a Euclidean space. This is part of preprocessing: that is how we call the transformations we perform on the dataset prior to training.

The canonical way to transform categorical features to feature vectors is one-hot encoding: given a finite set $C=\{c_0,\dotsc,c_{l-1}\}$, we map entry $c_j$ to the $j$-th standard unit vector $e_j\in\mathbf R^l$.

With this, we get an input feature space with $d=10$ dimensions.

Model: Affine Transformation

Our model will be an affine transformation $$ f(\mathbf x;\theta)=\mathbf x^T\cdot\mathbf w_0 + b $$ where $\mathbf w_0\in\mathbf R^{d_0}$ is the weight vector and $b\in\mathbf R$ is the bias. That is, we have $d:=d_0+1$ parameters: $\theta=(\mathbf w_0, b)$.

Preprocessing: Add Contant 1 Feature

To simplify notation, we can augment the feature vectors $\mathbf x$ with a constant 1 feature: for a feature vector $\mathbf x\in\mathbf R^{d_0}$, let $\tilde{\mathbf x}=(\mathbf x,1)\in\mathbf R^{d_0+1}=\mathbf R^d$. Then we can formulate our model as a linear transformation: $$ f(\mathbf x; \theta)=\tilde{\mathbf x}^T\cdot\mathbf w' $$ where the weight vector $\mathbf w'=(\mathbf w, b)$.

Question

In this particular case, do we actually need this constant 1 feature? Or is it redundant (linearly dependent on the other features)?

Preprocessing: Train-Test Split

The entire dataset $\mathscr D$ has 4177 entries. With a 85%-15% train-test split, we'll get a train set $\mathscr D_\text{train}$ of 3550 entries and a test set $\mathscr D_\text{test}$ of 627 entries.

Pseudorandom Generators

Taking this random split is the first time we use a stochastic method. So, let's discuss stochasticity in Machine Learning.

Machine Learning experiments have two opposing requirements:

they should be reproducible and
they should use random sampling.

To reconcile these two, we use pseudorandom generators:

These are stateful, deterministic objects, that is if you know the present state, then you can precisely determine the future outputs.
On the other hand, the outputs approximate probabilistic distributions to the extent that Machine Learning experiments require.

Warning

The starting state of a pseudorandom generator is determined by the seed. Usually, it is an integer. Never forget to

set the seed and
record it

at the start of an experiment. Otherwise, it may get arbitrarily difficult to reproduce results or errors.

For maximum safety, I recommend also recording:

the exact code you got the experiment with, for example with a git commit ID¹ and
the versions of the modules you used, which in turn can be gotten using a mamba list or pip list terminal command.

Note

This and other requirements for secure applications such as in cryptography are checked by statistical tests. For example, see here: BSI evaluation criteria for pseudorandom number generators

Preprocessing: Summary

To summarize, we perform the following operations on the dataset:

Using one-hot encoding, we transform the categorical features to numerical vectors.
We add an additional constant 1 feature so that the model can be written as a linear transformation.
As we have only one dataset, we make a train-test split.
To simplify notation, we stack the feature vectors to a feature matrix and the targets to a target vector.

We ended up with the following data:

a $3550\times10$ train feature matrix $X_\text{train}$,
a 3550-dimensional train target vector $\mathbf y_\text{train}$,
a $627\times10$ test feature matrix $X_\text{test}$ and
a 627-dimensional test target vector $\mathbf y_\text{test}$,

Loss: Mean Squared Error

We'll use Mean Squared Error (MSE) as loss function. That is:

We'll set the weight vector $\mathbf w\in\mathbf R^d$ so that the train loss $$ \ell_\text{MSE}(\mathscr D_\text{train};\theta)=\frac{1}{|\mathscr D_\text{train}|}|X_\text{train}\mathbf w-\mathbf y_\text{train}|^2 $$ is minimal.
We'll evaluate our model $f_\theta$ by calculating the test loss $$ \ell_\text{MSE}(\mathscr D_\text{test}; \theta)=\frac{1}{|\mathscr D_\text{test}|}|X_\text{test}\mathbf w-\mathbf y_\text{test}|^2. $$

Training: Least Squares

After scalar multiplication, we are trying to find the minimum of the cumulated loss $$ \ell(\mathbf w)=|X_\text{train}\mathbf w - \mathbf y_\text{train}|^2 $$ That is, we're solving a least squares problem. As you may have seen this in a course and we don't actually need the solution method, I describe it in an appendix below.

In the coding portion, we just need two things:

Check that the columns of $X$ are independent and if not, discard redundant columns. That is, we want no redundant features.
Use the built-in least squares solver of torch.

Note that this is the only problem in this course where we can actually find the global minimum of the function we're trying to optimize. We'll only be able to approximate the minimum.

On Arrays and Tensors

The basic object we're working with in Deep Learning is the array. This is a multidimensional grid of values of the same data type. Note that by dimension of an array, we mean the dimension of the grid.

That is, one possible way to represent an array is as a function $$ [n_0]\times\dotsb\times[n_{r-1}]\xrightarrow aK $$ where

for a nonnegative integer $n$, we let $[n]=\{0,\dotsc,n-1\}$ and
$K$ is a set of numbers,
1. either Mathematical, such as the integers $\mathbf Z$ or the reals $\mathbf R$
2. or Computer Scientifical, such as 64-bit integers int64 or 32-bit floating point numbers float32.

For a collection of indices $(i_0,\dotsc,i_{r-1})\in[n_0]\times\dotsb\times[n_{r-1}]$, we denote $a((i_0,\dotsc,i_{r-1}))$ by $a[i_0,\dotsc,i_{r-1}]$.

The shape of the array is the tuple $(n_0,\dotsc,n_{r-1})$.

For example:

A vector is a 1-dimensional array of shape $(d,)$. The dimension of the vector is the single value in the shape.
A matrix is a 2-dimensional array of shape $(m, n)$. Usually, the first shape value is the number of rows and the second shape value is the number of columns.
A collection of matrices of the same shape is a 3-dimensional array.

Now popular deep learning libraries such as tensorflow or pytorch, our library of preference, refer to these objects as tensors instead of arrays. This is not 100% correct as Mathematically, a tensor is a multilinear transformation. It can only be described as a multidimensional grid of scalars once you select a basis in each vector space involved.

In applications, usually the vector spaces are the standard ones $K^d$ with the standard bases, this is how you can get around this distinction.

Appendix: Least Squares

We are trying to find the minimum of the cumulated loss $$ \ell(\mathbf w)=|X\mathbf w - \mathbf y|^2 $$ Up to discarding redundant columns, we can assume that the columns of $X$ are independent.

Our treatment of least squares follows the textbook Murphy - Probabilistic Machine Learning: An Introduction, Subsection 11.2.2.

Use Dot Product Rules

Recall that

The squared length $|\mathbf v|^2$ of a vector is the dot product $\mathbf v^T\cdot\mathbf v$.
The dot product $(\mathbf v, \mathbf w)\mapsto\mathbf v^T\cdot\mathbf w$ is a symmetric, bilinear operation.

With this, we can rewrite our cumulated loss as

\[\begin{align*} \ell(\mathbf w)&=|X\mathbf w-\mathbf y|^2 \\ &=(X\mathbf w-\mathbf y)^T\cdot(X\mathbf w-\mathbf y) \\ &=\mathbf w^TX^TX\mathbf w-2\mathbf w^TX^T\mathbf y + \mathbf y^T\mathbf y. \end{align*}\]

Calculate gradient

Our cumulated loss function is a vector-to-scalar function $\mathbf R^d\xrightarrow\ell\mathbf R$. The local extrema (minima and maxima) of such a function are among the vanishing points of its gradient. The gradient is the vector-to-vector function $$ \nabla_\mathbf w\ell=\left(\frac{\partial\ell}{\partial w_1},\dotsc,\frac{\partial\ell}{\partial w_d}\right) $$

To calculate the gradient of our particular function, we can use the identity $$ \nabla_\mathbf x(\mathbf x^TA\mathbf x)=(A+A^T)\mathbf x. $$

With this, we get $$ \nabla_\mathbf w\ell=2X^TX\mathbf w-2X^T\mathbf y. $$

Solve normal equation

That is, a parameter vector $\mathbf w$ is a local extremum of $\ell$ if and only if we have $$ X^TX\mathbf w=X^T\mathbf y. $$ This is called the normal equation of the system: geometrically, the solution $\mathbf w$ is the orthogonal projection of $\mathbf y$ onto the subspace generated by the rows of $X$.

As the columns of $X$ are independent, the matrix $X^TX$ is invertible, thus the normal equation has a unique solution.

Note that Mathematically, we can write the solution as $$ \mathbf w=(X^TX)^{-1}X^T\mathbf y. $$ But on a computer, this would be wasteful. Instead, we use the least squares solver functionality directly.

Calculate Hessian

We found out that there is a unique vanishing point of the gradient $\nabla_\mathbf w\ell$. To see if this is an extremum of $\ell$, we can calculate the Hessian, that is the matrix

\[ H(\ell)=\begin{pmatrix} \frac{\partial^2\ell}{\partial w_1w_1} & \dots & \frac{\partial^2\ell}{\partial w_1w_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2\ell}{\partial w_dw_1} & \dots & \frac{\partial^2\ell}{\partial w_dw_d} \end{pmatrix}. \]

In this particular case, we have $$ H(\ell)=X^TX. $$

This matrix is positive definite: for all $\boldsymbol0\ne\mathbf v\in\mathbf R^d$, as the columns of $X$ are linearly independent, we have $$ \mathbf v^TX^TX\mathbf v=|X\mathbf v|^2>0. $$

This shows that the function $\ell$ is convex, thus the unique vanishing point $\mathbf w$ of $\nabla_\mathbf w\ell$ we found is the global minimum.

git is a version control system you can use to track changes in a codebase such as the course repository. After each unit of recorded change -- a commit --, it generates a unique ID that can be used to revert the codebase to the state at the commit. ↩