2/7 Linear Regression
Today, we'll go through the "Hello World" of machine learning: linear regression on a small dataset.
Course Logistics: Homework
You will be assigned your first homework today. You will get homework assigments roughly once per week. Each homework will be due 2 weeks after the day it was announced, at 10:00 am. A precise schedule will be provided. You can expect me to answer a question related to a homework if asked at least 24 hours prior to this time.
If you hand in your homework less than a week late, then your score will be multiplied by $$ \cos\left(p\frac{\pi}{2}\right)\text{ where }p = \frac{\text{time since homework was due}}{\text{1 week}}. $$ That is, for example:
- If you hand in your homework 1 day late, your score will be multiplied by \(\cos\frac{\pi}{14}\approx0.975\).
- If you hand in your homework 2.5 days late, your score will be multiplied by \(\cos\frac{5\pi}{28}\approx0.847\).
- If you hand in your homework 6 days late, your score will be multiplied by \(\cos\frac{3\pi}{7}\approx0.223\).
You cannot hand in your homework more than a week late.
Your worst homework score will be dropped.
Please hand in homework via email, in py files. The filename should have the pattern {name}-{homework id}.py. For example, I would submit the third homework as pal-zsamboki-3.py. Please include in your submission any custom modules you import from.
You are strongly encouraged to discuss homework problems with other students in the course, however, you are required to write your own solutions. You are supposed to understand your own solutions in full detail.
If you rely on sources different from the course notes, it is expected that you name these sources and the extent you have made use of them. In particular, if you rely on a virtual assistant (chatbot), please include the conversation with your submission.
Dataset: Abalone
We'll use the Abalone dataset. In this dataset, we're trying to predict the age of abalone through 8 other measurements such as sex or length.
Preprocessing: One-Hot Encoding
Out of the 8 measurements, we have 7 scalar features. On the other hand, sex is a categorical feature with 3 possible values: female, infant and male. Therefore, the feature space is a subset of the product \(\mathbf R^7\times\{F, I, M\}\).
To be able to use linear regression, we need to convert this feature space to a subset of a Euclidean space. This is part of preprocessing: that is how we call the transformations we perform on the dataset prior to training.
The canonical way to transform categorical features to feature vectors is one-hot encoding: given a finite set \(C=\{c_0,\dotsc,c_{l-1}\}\), we map entry \(c_j\) to the \(j\)-th standard unit vector \(e_j\in\mathbf R^l\).
With this, we get an input feature space with \(d=10\) dimensions.
Model: Affine Transformation
Our model will be an affine transformation $$ f(\mathbf x;\theta)=\mathbf x^T\cdot\mathbf w_0 + b $$ where \(\mathbf w_0\in\mathbf R^{d_0}\) is the weight vector and \(b\in\mathbf R\) is the bias. That is, we have \(d:=d_0+1\) parameters: \(\theta=(\mathbf w_0, b)\).
Preprocessing: Add Contant 1 Feature
To simplify notation, we can augment the feature vectors \(\mathbf x\) with a constant 1 feature: for a feature vector \(\mathbf x\in\mathbf R^{d_0}\), let \(\tilde{\mathbf x}=(\mathbf x,1)\in\mathbf R^{d_0+1}=\mathbf R^d\). Then we can formulate our model as a linear transformation: $$ f(\mathbf x; \theta)=\tilde{\mathbf x}^T\cdot\mathbf w' $$ where the weight vector \(\mathbf w'=(\mathbf w, b)\).
Question
In this particular case, do we actually need this constant 1 feature? Or is it redundant (linearly dependent on the other features)?
Preprocessing: Train-Test Split
The entire dataset \(\mathscr D\) has 4177 entries. With a 85%-15% train-test split, we'll get a train set \(\mathscr D_\text{train}\) of 3550 entries and a test set \(\mathscr D_\text{test}\) of 627 entries.
Pseudorandom Generators
Taking this random split is the first time we use a stochastic method. So, let's discuss stochasticity in Machine Learning.
Machine Learning experiments have two opposing requirements:
- they should be reproducible and
- they should use random sampling.
To reconcile these two, we use pseudorandom generators:
-
These are stateful, deterministic objects, that is if you know the present state, then you can precisely determine the future outputs.
-
On the other hand, the outputs approximate probabilistic distributions to the extent that Machine Learning experiments require.
Warning
The starting state of a pseudorandom generator is determined by the seed. Usually, it is an integer. Never forget to
- set the seed and
- record it
at the start of an experiment. Otherwise, it may get arbitrarily difficult to reproduce results or errors.
For maximum safety, I recommend also recording:
- the exact code you got the experiment with, for example with a
gitcommit ID1 and - the versions of the modules you used, which in turn can be gotten using a
mamba listorpip listterminal command.
Note
This and other requirements for secure applications such as in cryptography are checked by statistical tests. For example, see here: BSI evaluation criteria for pseudorandom number generators
Preprocessing: Summary
To summarize, we perform the following operations on the dataset:
- Using one-hot encoding, we transform the categorical features to numerical vectors.
- We add an additional constant 1 feature so that the model can be written as a linear transformation.
- As we have only one dataset, we make a train-test split.
- To simplify notation, we stack the feature vectors to a feature matrix and the targets to a target vector.
We ended up with the following data:
- a \(3550\times10\) train feature matrix \(X_\text{train}\),
- a 3550-dimensional train target vector \(\mathbf y_\text{train}\),
- a \(627\times10\) test feature matrix \(X_\text{test}\) and
- a 627-dimensional test target vector \(\mathbf y_\text{test}\),
Loss: Mean Squared Error
We'll use Mean Squared Error (MSE) as loss function. That is:
-
We'll set the weight vector \(\mathbf w\in\mathbf R^d\) so that the train loss $$ \ell_\text{MSE}(\mathscr D_\text{train};\theta)=\frac{1}{|\mathscr D_\text{train}|}|X_\text{train}\mathbf w-\mathbf y_\text{train}|^2 $$ is minimal.
-
We'll evaluate our model \(f_\theta\) by calculating the test loss $$ \ell_\text{MSE}(\mathscr D_\text{test}; \theta)=\frac{1}{|\mathscr D_\text{test}|}|X_\text{test}\mathbf w-\mathbf y_\text{test}|^2. $$
Training: Least Squares
After scalar multiplication, we are trying to find the minimum of the cumulated loss $$ \ell(\mathbf w)=|X_\text{train}\mathbf w - \mathbf y_\text{train}|^2 $$ That is, we're solving a least squares problem. As you may have seen this in a course and we don't actually need the solution method, I describe it in an appendix below.
In the coding portion, we just need two things:
- Check that the columns of \(X\) are independent and if not, discard redundant columns. That is, we want no redundant features.
- Use the built-in least squares solver of
torch.
Note that this is the only problem in this course where we can actually find the global minimum of the function we're trying to optimize. We'll only be able to approximate the minimum.
On Arrays and Tensors
The basic object we're working with in Deep Learning is the array. This is a multidimensional grid of values of the same data type. Note that by dimension of an array, we mean the dimension of the grid.
That is, one possible way to represent an array is as a function $$ [n_0]\times\dotsb\times[n_{r-1}]\xrightarrow aK $$ where
- for a nonnegative integer \(n\), we let \([n]=\{0,\dotsc,n-1\}\) and
- \(K\) is a set of numbers,
- either Mathematical, such as the integers \(\mathbf Z\) or the reals \(\mathbf R\)
- or Computer Scientifical, such as 64-bit integers
int64or 32-bit floating point numbersfloat32.
For a collection of indices \((i_0,\dotsc,i_{r-1})\in[n_0]\times\dotsb\times[n_{r-1}]\), we denote \(a((i_0,\dotsc,i_{r-1}))\) by \(a[i_0,\dotsc,i_{r-1}]\).
The shape of the array is the tuple \((n_0,\dotsc,n_{r-1})\).
For example:
- A vector is a 1-dimensional array of shape \((d,)\). The dimension of the vector is the single value in the shape.
- A matrix is a 2-dimensional array of shape \((m, n)\). Usually, the first shape value is the number of rows and the second shape value is the number of columns.
- A collection of matrices of the same shape is a 3-dimensional array.
Now popular deep learning libraries such as tensorflow or pytorch, our library of preference, refer to these objects as tensors instead of arrays. This is not 100% correct as Mathematically, a tensor is a multilinear transformation. It can only be described as a multidimensional grid of scalars once you select a basis in each vector space involved.
In applications, usually the vector spaces are the standard ones \(K^d\) with the standard bases, this is how you can get around this distinction.
Appendix: Least Squares
We are trying to find the minimum of the cumulated loss $$ \ell(\mathbf w)=|X\mathbf w - \mathbf y|^2 $$ Up to discarding redundant columns, we can assume that the columns of \(X\) are independent.
Our treatment of least squares follows the textbook Murphy - Probabilistic Machine Learning: An Introduction, Subsection 11.2.2.
Use Dot Product Rules
Recall that
- The squared length \(|\mathbf v|^2\) of a vector is the dot product \(\mathbf v^T\cdot\mathbf v\).
- The dot product \((\mathbf v, \mathbf w)\mapsto\mathbf v^T\cdot\mathbf w\) is a symmetric, bilinear operation.
With this, we can rewrite our cumulated loss as
Calculate gradient
Our cumulated loss function is a vector-to-scalar function \(\mathbf R^d\xrightarrow\ell\mathbf R\). The local extrema (minima and maxima) of such a function are among the vanishing points of its gradient. The gradient is the vector-to-vector function $$ \nabla_\mathbf w\ell=\left(\frac{\partial\ell}{\partial w_1},\dotsc,\frac{\partial\ell}{\partial w_d}\right) $$
To calculate the gradient of our particular function, we can use the identity $$ \nabla_\mathbf x(\mathbf x^TA\mathbf x)=(A+A^T)\mathbf x. $$
With this, we get $$ \nabla_\mathbf w\ell=2X^TX\mathbf w-2X^T\mathbf y. $$
Solve normal equation
That is, a parameter vector \(\mathbf w\) is a local extremum of \(\ell\) if and only if we have $$ X^TX\mathbf w=X^T\mathbf y. $$ This is called the normal equation of the system: geometrically, the solution \(\mathbf w\) is the orthogonal projection of \(\mathbf y\) onto the subspace generated by the rows of \(X\).
As the columns of \(X\) are independent, the matrix \(X^TX\) is invertible, thus the normal equation has a unique solution.
Note that Mathematically, we can write the solution as $$ \mathbf w=(X^TX)^{-1}X^T\mathbf y. $$ But on a computer, this would be wasteful. Instead, we use the least squares solver functionality directly.
Calculate Hessian
We found out that there is a unique vanishing point of the gradient \(\nabla_\mathbf w\ell\). To see if this is an extremum of \(\ell\), we can calculate the Hessian, that is the matrix
In this particular case, we have $$ H(\ell)=X^TX. $$
This matrix is positive definite: for all \(\boldsymbol0\ne\mathbf v\in\mathbf R^d\), as the columns of \(X\) are linearly independent, we have $$ \mathbf v^TX^TX\mathbf v=|X\mathbf v|^2>0. $$
This shows that the function \(\ell\) is convex, thus the unique vanishing point \(\mathbf w\) of \(\nabla_\mathbf w\ell\) we found is the global minimum.
-
gitis a version control system you can use to track changes in a codebase such as the course repository. After each unit of recorded change -- a commit --, it generates a unique ID that can be used to revert the codebase to the state at the commit. ↩