3/14 Nonlinear Function Approximation with Multilayer Perceptrons

Note that in all supervised learning tasks so far, we were optimizing affine transformations

\[ \mathbf R^n\xrightarrow{\mathbf x\mapsto \mathbf x^T\cdot W + \mathbf b}\mathbf R^k:W\in\mathbf R^{n\times k},\,\mathbf b\in\mathbf R^k. \]

Synthetic Dataset: Hypersphere Classification

Let us introduce a synthetic classification dataset that logistic regression may not be able to handle: the feature vectors of a given class $c\in[C]$ are sampled from the distribution

$$ r_c\left(\mathscr U(S^{n-1})+\mathscr N(\mathbf0,s)\right)+\mathbf x_c, $$ that is:

We sample points from the uniform distribution on an $(n-1)$-dimensional hypersphere

\[ S^{n-1}=\{(x_1,\dotsc,x_n)\in\mathbf R^n:x_1^2+\dotsb+x_n^2=1\}, \]
we add additive noise to this, sampled from the normal distribution with mean 0 and std $s$,
we multiply the sum with the radius $r_c\in\mathbf R_{>0}$
and finally we translate the product with the offset $\mathbf x_c\in\mathbf R^n$.

For example, here's how 10 000 samples from the datasets with

the same ambient dimension $n=2$,
the same radii $0.8, 1.2$
the same additive noise std $0.1$,
the same negative offset $(0, 0)$ and
varying positive offsets $(0, 0), (1, 1), (4, -2)$

look like:

2d toy datasets with varying positive offests — Synthetic classification datasets with $n=2$ feature dimensions. The colors of the vertices signify the classes of the dataset entries.

Recall that for a binary logistic regression classifier

$$ \mathbf P_\theta(Y=1|X=\mathbf x)=\sigma(\mathbf x^T\mathbf w+b), $$ the model predicts $\mathbf x$ is positive if this probability is larger that 50%, which is equivalent to

$$ \mathbf x^T\mathbf w+b>0. $$ Therefore, the entries that are predicted positive and negative are separated by the hyperplane

$$ \mathbf x^T\mathbf w+b=0, $$ the so-called decision boundary.

We'll train logistic regression classifiers by SGD on the synthetic datasets. For now, I'll write the accuracies, and draw the decision boundaries:

2d toy datasets with linear decision boundaries — Logistic regression on the synthetic datasets. The lines are the decision boundaries.

Question

Why can't we see the decision boundary in the left-hand case of no positive class offset?

Perceptrons

Given a binary logistic regression model $$ \mathbf x\mapsto\mathbf P_\theta(Y=1:X=\mathbf x)=\sigma(\mathbf x^T\mathbf w+b), $$ we can get the predicted class model as $$ \mathbf x\mapsto H(\mathbf x^T\mathbf w+b), $$ where $H$ is the Heaviside step (threshold) function¹

\[ H(z)=\begin{cases} 1 & z > 0 \\ 0 & z \le 0 \end{cases} \]

We call the predicted class model a perceptron.

Piecewise linear decision boundary with ReLU

Consider the 1D hypersphere dataset with no offset:

1D toy dataset — Synthetic dataset with $n=1$ feature dimension and no positive class offset.

Just like in the 2D case, there is no linear classifier that can succesfully separate the two classes. On the other hand, if we are allowed piecewise linear decision boundaries, we can do this:

Graph of the piecewise linear decision boundary.

It turns out that we can get a decision boundary like this with an Artificial Neural Network. We'll see in the universal approximation theorem that we only need to use one non-polynomial scalar-to-scalar function. A common choice is the Rectified Linear Unit (ReLU), that sets negative values to 0 elementwise:

\[ \mathrm{ReLU}(x)=\max(0, x). \]

With this, we can form our binary classifier as

\[ f(x)=\mathrm{ReLU}\left(x\cdot\begin{pmatrix} 1 & -1 \end{pmatrix}\right) \cdot \begin{pmatrix} 1 \\ 1 \end{pmatrix} - 1 \]

Let's see what this function does one operation at a time!

Here's the dataset again:

First of all, it applies the linear transformation $x \mapsto x\cdot\begin{pmatrix} 1 \\ -1 \end{pmatrix}$

1d toy dataset linear transformation — The dataset transformed by $x \mapsto x\cdot\begin{pmatrix} 1 & -1 \end{pmatrix}$

Then it applies ReLU elementwise

Finally, it applies the linear transformation $\mathbf h\mapsto \mathbf h^T\cdot \begin{pmatrix} 1 \\ 1 \end{pmatrix} - 1$. Instead of plotting the 1D image, we plot the previous image together with the decision boundary we get with this.

We get an accuracy rate of about 97.3%. Note that the function $x \mapsto \mathrm{ReLU}\left(x\cdot\begin{pmatrix} 1 & -1 \end{pmatrix}\right)$ transformed the feature vectors in a way that using an affine classifier $\mathbf h\mapsto \mathbf h^T\cdot \begin{pmatrix} 1 \\ 1 \end{pmatrix} - 1$ became feasible.

Multilayer Perceptrons

Construction

A Multilayer Perceptrons (MLP) is a composite function of the form

\[ \mathbf R^{n_0}\xrightarrow{A_0}\mathbf R^{n_1}\xrightarrow a\mathbf R^{n_1}\xrightarrow{A_1} \dotsb\mathbf R^{n_{d-1}}\xrightarrow{A_{d-1}}\mathbf R^{n_d}\xrightarrow a\mathbf R^{n_d} \xrightarrow{A_d}\mathbf R^{n_{d+1}}\xrightarrow g\mathbf R^k \]

where:

For each $l=0,\dotsc,d-1$: the map $\mathbf R^{n_{l-1}}\xrightarrow{A_l}\mathbf R^{n_l}$ is an affine transformation $\mathbf x\mapsto \mathbf x^T\cdot W_l+\mathbf b_l$ given by
1. a weight matrix $W_l\in\mathbf R^{n_{l-1}\times n_l}$ and
2. a bias vector $\mathbf b_l\in\mathbf R^{n_l}$.
The map $\mathbf R\xrightarrow a\mathbf R$ is the activation function, an almost everywhere differentiable function that is applied elementwise to the vectors.
1. We call the composites $\mathbf R^{n_l}\xrightarrow{A_l}\mathbf R^{n_{l+1}}\xrightarrow a\mathbf R^{n_{l+1}}$ hidden layers.
2. The composite $\mathbf R^{n_d}\xrightarrow{A_d}\mathbf R^{n_{d+1}}\xrightarrow g\mathbf R^k$ is called the head. The optional function $g$ is a domain transformation function such as a softmax or a sigmoid.

A MLP with 1 hidden layer is called a Shallow Neural Network and a MLP with more than 1 hidden layers is called a Deep Neural Network. These functions are special cases of Feedforward Neural Networks (FNN or FFN) and Artificial Neural Networks (ANN).

Activation Functions

Common activation functions include:

The ReLU function.
The tanh function.
The leaky ReLU function

\[ a(x) = \begin{cases} x & x \ge 0 \\ \alpha x & x < 0 \end{cases} \]

where $\alpha>0$ is a hyperparameter.
4. The Exponential Linear Unit (ELU) function [1]

\[ a(x) = \begin{cases} x & x \ge 0 \\ \alpha(e^x - 1) & x < 0 \end{cases} \]

where $\alpha>0$ is a hyperparameter.
5. The Gaussian Error Linear Unit (GELU) [2]

\[ a(x) = x\cdot\Phi(x) \]

where $\Phi$ is the cumulative distribution function of the standard normal distribution.

Parameters and Hyperparameters

The following are hyperparameters of a MLP:

The number of hidden layers $d$.
The number of dimensions of the inner representations $n_1,\dotsc,n_d$.
The initialization distribution of the weight matrices $W_l$ and bias vectors $\mathbf b_l$.
The choice of activation functions.

On the other hand, the entries of the weight matrices and bias vectors are the parameters of the MLP that we want to train. That is, in total, we have $$ \sum_{l=0}^d(n_l + 1)n_{l+1} $$ parameters.

Universal Approximation Theorems

In what follows, we give a sampling of theoretical results about the expressive power of MLPs.

Definition 1. Let $C(\mathbf R^n, \mathbf R^m)$ denote the collection of continuous functions $\mathbf R^n\to\mathbf R^m$. We let $C(\mathbf R^n,\mathbf R)=C(\mathbf R^n)$.

Definition 2. Consider a subset $\mathscr M\subseteq C(\mathbf R^n, \mathbf R^m)$. Then we say that $\mathscr M\subseteq C(\mathbf R^n, \mathbf R^m)$ is dense in the topology of uniform convergence on compact subsets if for all continuous maps $f\in C(\mathbf R^n, \mathbf R^m)$, compact subsets $K\subseteq\mathbf R^n$ and positive numbers $\varepsilon>0$, there exists a map $g\in\mathscr M$ such that we have $$ \max_{\mathbf x\in K}|f(\mathbf x)-g(\mathbf x)|<\varepsilon. $$

Shallow Neural Networks of Arbitrary Width

Theorem 3. [3, Theorem 3.1] Let $a\in C(\mathbf R)$ be a continuous map. Let $\mathrm{SNN}(a,n)\subseteq C(\mathbf R^n)$ denote the collection of all shallow neural networks with input dimension $n$, final dimension 1, and $a$ as activation function. Then the following statements are equivalent:

The subset $\mathrm{SNN}(a,n)\subseteq C(\mathbf R^n)$ is dense in the topology of uniform convergence on compact subsets.
The map $a\in C(\mathbf R)$ is not a polynomial.

Deep Neural Networks of Limited Width

Theorem 4. [4, Theorem 3.2] Let $a\in C(\mathbf R)$ be a nonaffine continuous map. Suppose that there exists $x\in\mathbf R$ such that $a$ is differentiable at $x$ with nonzero derivative. Let $\mathrm{DNN}(a,n,k)\subseteq C(\mathbf R^n, \mathbf R^m)$ denote the collection of all deep neural networks with input dimension $n$, an arbitrary number of hidden layers with hidden dimension $k$, final dimension $m$, and $a$ as activation function. Then the subset $\mathrm{DNN}(a,n,n+m+2)\subseteq C(\mathbf R^n, \mathbf R^m)$ is dense in the topology of uniform convergence on compact subsets.

Representation Capacity of ReLU MLPs as Function of Depth and Width

Definition 5. Let $\mathbf R^{n_0}\xrightarrow F\mathbf R^{n_1}$ be a piecewise linear function. Then a linear region is a maximal connected subset $X\subseteq\mathbf R^{n_0}$ such that $F|X$ is linear.

Theorem 6. [5, Corollary 6] A ReLU MLP with input dimension $n_0$ and $L$ hidden layers of width $n\ge n_0$ can compute functions that have $\Omega\left((n/n_0)^{(L-1)n_0}n^{n_0}\right)$ linear regions.

References

[1] Djork-Arné Clevert, Thomas Unterthiner and Sepp Hochreiter: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), ICLR 2016, arXiv:1511.07289

[2] Dan Hendrycks and Kevin Gimpel: Gaussian Error Linear Units (GELUs), arXiv:1606.08415

[3] Allan Pinkus: Approximation Theory of the MLP Model in Neural Networks, Acta Numerica 8 (1999): 143–195. doi pdf

[4] Patrick Kidger, Terry Lyons: Universal Approximation with Deep Narrow Networks, Proceedings of Thirty Third Conference on Learning Theory, PMLR 125 (2020): 2306--2327. link

[5] Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio: On the Number of Linear Regions of Deep Neural Networks, 2014. NIPS, Proceedings of the 28th International Conference on Neural Information Processing Systems 2: 2924--2932. link

The value of $H(0)$ changes by convention. ↩