2/5 Basics of Supervised Learning

Course Logistics: General Information

Instructor

Office Hours

By appointment, online or in person. Write or tell me if you want to set up a meeting.

Class Website

You can access class content on the class website and the class repository.

To gain access to the latter, please write me an email with your codeberg account; if you don't have one, you can create one on the Codeberg website.

Info

I am not hosting the course content on GitHub as I try to avoid giving away my data, especially to big companies such as Microsoft. Follow this link to read more on the issue: Software Freedom Conservancy: Give Up GitHub!

Although all necessary class material will be posted on the website, there are at least 3 good reasons to use the class repository:

I will send out announcements, for example when I publish new homework assignments, on the repository.
Using issues is a good way to discuss course content.
It is easiest to download the latest versions of the lab notebooks by a git pull.

Classes: Part Lectures, Part Labs

Classes will be part lectures and part labs. You are very welcome to work in groups! Lab notebooks without and with solutions will be posted on the class repository and the class website.

Work Environment: Local or Online

You can either work on the lab notebooks and the homeworks in a local environment or online. For setting up a local environment, see the Installation guide. After installation, you can view the notebooks on Jupyter Lab, which you can start in a browser tab by typing

jupyter lab

in a terminal, or a coding IDE (Integrated Development Environment) such as VS Code. Note that the latter may not work in some configurations; in that case, try Jupyter Lab.

For now, if you want to run your code or a notebook online, see Google Colab

Highly Recommended: Python Learning Exercises on Exercism

To make sure that you know the basic concepts of Python, it is highly recommended to solve at least the learning exercises on Exercism: Python Key Concepts on Exercism.

Coursework: Homework, Written Presentation, Oral Presentation

Your total score will be made up of three portions: homework, written presentation, oral presentation. You can see all rules in the Syllabus. I shall discuss homeworks this Friday and final project presentations next Wednesday.

Using Virtual Assistants (Chatbots)

In classwork such as a homework or the presentation, if you use a virtual assistant, please include in your submission your conversation with the virtual assistant. Usually, you can obtain a shareable link from the conversation interface.

Also, note that you have to write homeworks, the written presentation, and the text of the oral presentation on your own, do not copy the output of the virtual assistant.

Info

I recommend using virtual assistants the privacy policy of which clearly states that your conversations will not be shared with anyone, in any form.

One such option is HuggingChat. This is a website by Hugging Face where you can converse with a selection of open source virtual assistants. You can see their privacy policy following this link.

As an alternative, certain subscriptions may offer more privacy. For example, if you use ChatGPT with a Team subscription, OpenAI models will not be trained on your usage data: Enterprise privacy at OpenAI. 2 people can make a team, you can ask a friend.

Supervised Learning: Basic Notions

For most of the course, we shall be dealing with supervised learning: We are given two sets $\mathscr X, \mathscr Y$ and a sequence of pairs $\mathscr D=((x_i, y_i)\in\mathscr X\times\mathscr Y: i\in I)$. We want to predict the values $y_i$ from the $x_i$.

We usually refer to $\mathscr D$ as the dataset. Note that it's not actually a subset of $\mathscr X\times\mathscr Y$ as you may have $(x_i,y_i)=(x_{i'},y_{i'})$ for some $i\ne i'$.

Input Features

The set $\mathscr X$ is called the (input) feature space. Usually, we have $\mathscr X\subseteq\mathbf R^d$ for some $d$. The entries $x\in\mathscr X$ are (input) feature vectors. The components $x_i$ are called features.

Note

We will see later that the power of Deep Neural Network comes from the ability to learn progressively more and more refined internal (hidden) representations of the data. These internal representations are also vectors, their components are also called features. This is why it will be important later on to express whether a feature is an input or a hidden feature.

Targets

The set $\mathscr Y$ is called the target space. The entries $y\in\mathscr Y$ are targets. Usually, we have one of the following two options:

Regression: We have $\mathscr Y\subseteq\mathbf R^k$ for some $k$. In this case, we say that we are dealing with a regression task. For example, in the Abalone dataset that we'll use for introduction, we are trying to predict the age of abalone (a type of mollusc) from physical measurements.
Classification: We have $|\mathscr Y|<\infty$. Then we are dealing with a classification task. The entries of $\mathscr Y$ are called classes or labels. For example, in the MNIST dataset that's going to often recur during the course, we are trying to tell a written digit from its pixels as an image.

Datasets as Samples from a Distribution

In the language of Probability Theory, the sets $\mathscr X$ and $\mathscr Y$ are the sets of outcomes of random variables $X$ and $Y$, and the dataset $\mathscr D$ is a collection of samples from the joint distribution $(X, Y)$ of $X$ and $Y$.

Prediction

By predicting the $y_i$ from the $x_i$, we can mean at least the following two options:

Predict target values: Given $x_i$, one can try to directly approximate the corresponding $y_i$. That is, we want to create a function $\mathscr X\xrightarrow f\mathscr Y$ such that $f(x_i)$ is close to $y_i$ for all $i\in I$. This is the approach we'll take for regression.
Predict target marginal distributions: Given $x_i$, one can also try to approximate the marginal distribution $Y|X=x_i$. This is what we'll do in the classification case. That is, for each $x\in\mathscr X$ and $y\in\mathscr Y$, we'll try to approximate the probability $\mathbf P(Y=y|X=x)$. Or put in a fancier way, we'll want to create a function $\mathscr X\xrightarrow f\mathscr P(\mathscr Y)$ where $\mathscr P(\mathscr Y)$ denotes the set of probability distributions on the set $\mathscr Y$.

Parametric Models

In machine learning, the function $\mathscr X\xrightarrow f\mathscr Y\text{ or }\mathscr P(\mathscr Y)$ will be a function with many parameters $\theta$. Therefore, we'll usually write $f(x)=f(x;\theta)\text{ or }f_\theta(x)$.

For example, in our first experiment, we'll let $f_\theta$ be an affine transformation: $$ f_\theta(\mathbf x)=\mathbf x^TW+\mathbf b^T\text{ where }W\in\mathbf R^{d\times k}\text{ and }\mathbf b\in\mathbf R^k. $$ We call $W$ the weight matrix and $\mathbf b$ the bias vector.

Loss Functions

We determine the parameter values $\theta$ algorithmically. This is called training the model.

We train the model $f_\theta$ by trying to find the parameter set $\theta$ that fits the dataset $\mathscr D$ as much as possible from a given perspective. This perspective is formalized by the loss function $\ell(x, y;\theta)$: this gives a measure of how much $f(x; \theta)$ misses $y$.

For example, for regression one can use the Squared Error (SE), which is the squared distance between $f(x)$ and $y$: $$ \ell_{SE}(x, y;\theta)=|y-f(x;\theta)|^2 $$

To evaluate a model $f_\theta$ on a dataset, we usually take the average of the losses on the data points. That is, we let $$ \ell(\mathscr D; \theta)=\frac{1}{|I|}\sum_{i\in I}\ell(x_i, y_i; \theta). $$ For regression, we get the Mean Squared Error (MSE): $$ \ell_{MSE}(\mathscr D; \theta)=\frac{1}{|I|}\sum_{i\in I}\ell_{SE}(x_i, y_i; \theta) $$

Train-Test Split

Recall that we think of the dataset $\mathscr D$ as a sample from the joint distribution of $X$ and $Y$. Therefore, we need $f$ to generalize to unseen data. That is, what we actually want to minimize is the expected loss $$ \mathbf E\ell(X, Y;\theta). $$ Now we can't actually compute this, that's why we estimate it with the average loss on a dataset.

This means that we need two datasets, sampled independently: a train dataset $\mathscr D_\text{train}\sim(X, Y)$ and a test dataset $\mathscr D_\text{test}\sim(X, Y)$. With this, we

Set $\theta$ so that $\ell(\mathscr D_\text{train}; \theta)$ is small (we can't necessarly find the global minimum, but we try).
Calculate $\ell(\mathscr D_\text{test}; \theta)$ with the same $\theta$. This estimates how well our model $f_\theta$ generalizes to unseen data.

Often we only have access to a single dataset $\mathscr D=((x_i, y_i): i\in I)$. In this case, we perform a train-test split:

We decide on a train-test proportion. Usually this is 85%-15% or 90%-10%. Let's denote these by $p_\text{train}$ and $p_\text{test}$.
We randomly separate the index set $I$ to a $p_\text{train}$ portion $I_\text{train}$ and a $p_\text{test}$ portion $I_\text{test}$ (with rounding).
We let $\mathscr D_\text{train}=((x_i, y_i): i\in I_\text{train})$ and $\mathscr D_\text{test}=((x_i, y_i): i\in I_\text{test})$.