3/19 Initialization

Last time, we defined Multilayer Perceptrons (MLP) a family of parametric functions of the form $$ \mathbf R^{n_0}\xrightarrow{A^{(0)}}\mathbf R^{n_1}\xrightarrow a \mathbf R^{n_1}\xrightarrow{A^{(1)}}\mathbf R^{n_2}\xrightarrow a \dotsb\xrightarrow a\mathbf R^{n_L}\xrightarrow{A^{(L)}}\mathbf R^{n_{L+1}} $$ where for each $l=0,\dotsc,L$: we have an affine transformation $$ A^{(l)}(\mathbf x)=\mathbf x^TW^{(l)} + \mathbf b^{(l)} $$ for parameters the weight matrix $W^{(l)}\in\mathbf R^{n_l\times n_{l+1}}$ and bias vector $\mathbf b^{(l)}\in\mathbf R^{n_{l+1}}$.

Chain Rule for Jacobians

Being a composite function, gradients propagate through the model via the chain rule: Let $$ \mathbf R^l\xrightarrow{f=(f_1,\dotsc,f_{m})}\mathbf R^m $$ be a differentiable vector-to-vector function, and $\mathbf x\in\mathbf R^l$ a vector. Then the Jacobian of $f$ at $\mathbf x$ is the matrix

\[ J_f(\mathbf x)= \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_l} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_l} \end{pmatrix} \]

Theorem (Chain Rule for Vector-To-Vector functions) Let $\mathbf R^l\xrightarrow f\mathbf R^m\xrightarrow g\mathbf R^n$ be a composable pair of differentiable functions. Then for any $\mathbf x\in\mathbf R^l$, we have $$ J_{g\circ f}(\mathbf x)=J_g(f(\mathbf x))J_f(\mathbf x). $$

Recall that in GD, given a parametric model

\[ \mathbf R^n\times\mathbf R^N\xrightarrow{f=f(\mathbf x,\theta)}\mathbf R^k, \]

we take its gradient $\nabla_\theta f$ with respect to the parameters $\theta$. That is, we fix the input feature vector $\mathbf x$ and derivate with respect to the parameters.

Example (Jacobian of an affine transformation) Consider the case of an affine transformation $f(\mathbf x,W,\mathbf b)=\mathbf x^T W+\mathbf b$ with weight matrix $W\in\mathbf R^{n\times k}$ and bias vector $\mathbf b\in\mathbf R^k$. Then the Jacobian with respect to the parameters is made up by the following partial derivatives:

\[ \left(\frac{\partial f}{\partial w_{ij}}\right)_{j'}=\begin{cases} x_i & j=j',\\ 0 & \text{else,} \end{cases}, \]

\[ \left(\frac{\partial f}{\partial b_j}\right)_{j'}=\begin{cases} 1 & j=j',\\ 0 & \text{else,} \end{cases} \]

Vanishing and Explosion of Values

Now we shall investigate the variance of the input and hidden feature vectors as data progresses through an MLP. Note that, by the chain rule, these variances directly affect the variances of the gradients.

Recall that $X$ denotes the input feature vector random variable. Let's also denote this $H^{(0)}$, and introduce additional notation for the random variables we get as the data passes through the MLP:

\[\begin{align*} Z^{(1)} &= XW^{(0)}+\mathbf b^{(0)} \\ H^{(1)} &= a(Z^{(0)}) \\ &\vdots \\ Z^{(l+1)} &= H^{(l)}W^{(l)}+\mathbf b^{(l)} \\ H^{(l+1)} &= a(Z^{(l)}) \\ &\vdots \\ Z^{(L+1)} &= H^{(L)}W^{(L)}+\mathbf b^{(L)} \\ \end{align*}\]

The affine images $Z^{(l)}$ are called pre-activations, and their images $H^{(l)}$ by the activations are called hidden units.

Let's see how the variance of the $i$-th feature vector component changes as the data passes through the $l$-th layer. We use the following general results on expected value:

Proposition [1, Appendix C.2.1]. Let $X$ and $Y$ be random variables. Then we have the following results:

Expected value is linear: for any scalars $a,b\in\mathbf R$, we have $$ \mathbf E(aX+bY)=a\mathbf EX+b\mathbf EY. $$
Suppose that $X$ and $Y$ are independent. Then expected value commutes with product: we have $$ \mathbf E(XY)=\mathbf E(X)\mathbf E(Y). $$

Theorem. Suppose that

for $l=0,\dotsc,d$, the initial weights $w^{(l)}_{ij}$ are drawn independently from an identical distribution with mean 0 and variance $\sigma_l^2$,
the biases $\mathbf b^{(l)}$ are initialized to zero, and
the input feature vector random variable $X$ has mean¹ $\mathbf EX=0$ and variance $\mathbf VX=1$.

Then the following statements hold:

The means of the pre-activations are zero: $\mathbf EZ^{(l)}=0$ for all $l=0,\dotsc,L$.
The second moment $\mathbf E((H^{(l)}_j)^2)$ of the $j$-th component of the $l$-th hidden unit does not depend on the component index $j=1,\dotsc,n_l$. Let's denote this value by $\mathbf E((H^{(l)})^2)$.
The variance of the $j$-th component of the $l$-th pre-activation does not depend on $j$ either. Moreover it has the following formula:

\[ \mathbf VZ^{(l+1)}:=\mathbf VZ^{(l+1)}_j=n_lE((H^{(l)})^2)\sigma_l^2. \]

Remark. We can make $\mathbf EX=0$ and variance $\mathbf VX=1$ in estimation by normalizing the feature vector set $\{\mathbf x_k:k=1,\dotsc,N\}$:

\[\begin{align*} \mathbf x_k &\leftarrow \mathbf x_k - \frac{1}{N}\sum_{k'=1}^N\mathbf x_{k'} \\ \mathbf x_k &\leftarrow \frac{\mathbf x_k}{\sqrt{\frac{1}{N}\sum_{k'=1}^N\mathbf x^{\odot 2}_{k'}}} \end{align*}\]

where by $\mathbf x^{\odot 2}_{k'}$ we denote elementwise squares, and moreover, the square root and division operations are performed componentwise.

Proof of Theorem. Independence of $\mathbf E(H^{(l)}_j)^2$ and $\mathbf E(Z^{(l)}_j)^2$ of $j$ follows from that all weights and biases are sampled independently.

Let's first calculate the expected value of the $j$-th component of the $(l+1)$-st pre-activation:

\[\begin{align*} \mathbf EZ^{(l+1)}_j &= \mathbf E\left(\sum_{i=1}^{n_{l}}H^{(l)}_iW^{(l)}_{ij}\right) \\ &= \sum_{i=1}^{n_{l-1}}\mathbf EH^{(l)}_i\mathbf E(W^{(l)}_{ij}) \\ &= 0. \end{align*}\]

In the variance formula, we can apply induction on $l$. Then we have:

\[\begin{align*} \mathbf VZ^{(l+1)}_j &= \mathbf E\left(\left(Z^{(l+1)}_j\right)^2\right) \\ &= \mathbf E\left(\left(\sum_{i=1}^{n_{l}}H^{(l)}_iW^{(l)}_{ij}\right)^2\right) \\ &= \sum_{i=1}^{n_{l}}\mathbf E((H^{(l)}_i)^2)\mathbf E((W^{(l)}_{ij})^2) \\ &= n_l\mathbf E((H^{(l)})^2)\sigma_l^2. \end{align*}\]

$\square$

Now note the following:

If the $\sigma_l$ are very big, then $\lim_{l\to\infty}n_l\mathbf E((H^{(l)})^2)\sigma_l^2=\infty$. Thus, the function becomes chaotic.
On the other hand, if the $\sigma_l$ are very small, then $\lim_{l\to\infty}n_l\mathbf E((H^{(l)})^2)\sigma_l^2=0$. Thus, the function becomes constant.

How can we prevent these phenomena from happening?

Glorot Initialization [2]: Approximately Linear Activation Around 0

If $a$ is approximately linear around 0, eg. $a=\tanh$, then we can approximate $\mathbf E((H^{(l)})^2)=\mathbf E(a(Z^{(l)})^2)$ by $a'(0)^2\mathbf E((Z^{(l)})^2)$ for $l>0$. Therefore, if we let $\sigma_l^2=\frac{1}{n_la'(0)}$, then we get $V(Z^{(l)})\approx1$ for all $l$.

He Initialization [3]: Using ReLU

If $a=\mathrm{ReLU}$, then we get $\mathbf E(a(Z^{(l)})^2)=\frac12\mathbf E((Z^{(l)})^2)$. Therefore, if we let $\sigma_l^2=\frac{2}{n_l}$, then we get $VZ^{(l)}=1$ for all $l$.

SELU [4]: Set parameters in ELU

One can let $W^{(l)}\sim\mathscr N(0, n_l^{-0.5})$, and set the parameters in the activation

\[ a(z)=\lambda\begin{cases} z & z > 0 \\ \alpha(e^z - 1) & z \le0 \end{cases} \]

so that we have $\mathbf E(H^{(l)})\approx0$ and $\mathbf V(H^{(l)})\approx1$ for all $l$. The parameters are $\lambda\approx1.0507$ and $\alpha\approx1.6733$.

Layer-Sequential Unit Variance (LSUV) Initialization [5]: Estimate it on a minibatch

If all else fails, one can still take randomly initialized weights, and then setting the variance to about 1 layer by layer: using just one minibatch, one can repeat calculating the output and dividing by the std until the variance gets close enough to 1. They initialize weights as random orthogonal matrices, since there are results indicating that they can be trained quicker that way [6].

References

[1] Simon J.D. Prince, Understanding Deep Learning, MIT Press, 2023, http://udlbook.com

[2] Xavier Glorot, Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010, http://proceedings.mlr.press/v9/glorot10a.html

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1026-1034, doi: 10.1109/ICCV.2015.123. https://www.cv-foundation.org/openaccess/content_iccv_2015/html/He_Delving_Deep_into_ICCV_2015_paper.html

[4] Günter Klambauer, Thomas Unterthiner, Andreas Mayr and Sepp Hochreiter, Self-Normalizing Neural Networks, Advances in Neural Information Processing Systems (NIPS), 2017, https://papers.nips.cc/paper_files/paper/2017/hash/5d44ee6f2c3f71b73125876103c8f6c4-Abstract.html

[5] Dmytro Mishkin and Jiri Matas, All you Need is a Good Init, https://arxiv.org/abs/1511.06422

[6] Andrew M. Saxe, James L. McClelland and Surya Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, Proceedings of ICLR, 2014, http://arxiv.org/abs/1312.6120

By a vector being equal to a scalar we mean that the vector is of constant value that scalar. ↩