4/25 Dropout, Normalization, and Residual Layers

Today, we study various improvements to ANNs.

Dropout [1]

In an Artificial Neural Network (ANN), we call neuron the part of the network giving a hidden feature, that is a column \((w^{(l)}_{ij}:i=0,\dotsc,n_l-1)\) of a weight matrix, and a bias vector component \(b^{(l)}_j\), given a layer index \(l\) and a hidden feature index \(0\le j\le n_l\).

Dropout is a regularization technique based on that neural networks are complex systems built up of many neurons: In each training step, we randomly mask a given proportion \(p\) of the neurons. This means to temporarily set the parameters of the chosen neurons to zero, and also exclude them from backpropagation and the optimization step. In practice, this amounts to setting random entries in the output tensor to 0.

This training procedure makes the network learn features less dependent on a complex combination of all features in a previous layer and thus more robust. To correct data variance, we multiply the output by \(\frac{1}{1-p}\). During inference, we again use all neurons.

Dropout makes the network function similar to an ensemble.

Normalization

In Notebook 0319, we discussed how it helps training and inference if data variance is controlled. With initialization methods, we saw how to set up the network before training. But as discussed then, there is no guarantee that the variance of data flowing through the network will remain controlled as parameters get changed. To keep data normalized, we can use normalization layers. These are inserted inside the ANN.

Batch Normalization [2]

In batch normalization, we want to normalize single feature components. Let \(n_1,\dotsc,n_s\) be sequence dimensions and \(d\) the feature dimension. Then, given a hidden representation random variable \(X^{(\ell)}\in\mathbf R^{n_1\times\dotsb\times n_s\times d}\) in layer \(\ell\), we want

\[ \mathbf EX^{(\ell)}_\mathbf i=0\text{ and }\mathbf VX^{(\ell)}_\mathbf i=1 \]

for all index tuples \(\mathbf i\in[n_1]\times\dotsb\times[n_s]\times[d]\).

Previously, we estimated the mean and variance of data on the entire dataset. As hidden representation distributions change during training, it would be prohibitively expensive to do that at each training step. Thus, at each training step, we estimate these values on minibatch statistics. Hence the name.

Learned affine transformation

An additional requirement is for the network to retain the capacity to represent the identity transformation. To this end, we introduce learnable scale \(\gamma^{(\ell)}_\mathbf i\) and offset \(\beta^{(\ell)}_\mathbf i\) parameters. This makes the mapping

\[ x^{(\ell)}_\mathbf i\mapsto\frac{x^{(\ell)}-\hat{\mathbf E} x^{(\ell)}_\mathbf i}{\sqrt{\hat{\mathbf V} x^{(\ell)}_\mathbf i} + \epsilon}\gamma^{(\ell)}_\mathbf i + \beta^{(\ell)}_\mathbf i, \]

where the hyperparameter \(\epsilon\) is there for numerical stability.

Rolling Averages at Inference

At inference, such as evaluation, we use as mean and variance estimates rolling averages

\[ y\leftarrow(1-\mu) y + y_\mathrm{new} \]

of the minibatch statistics.

Layer Normalization [3]

Note that batch normalization requires fixing the sequence dimensions \(n_1\times\dotsc\times n_s\) beforehand. Thus, it cannot be used with data of varying sequence dimensions, such as:

text
graphs
videos

Therefore, in layer normalization, we instead normalize across the dimensions in a single data entry. Note in particular that this means layer normalization works the same way in training and evaluation.

This too includes learnable scale and offset parameters.

ResNet [4]

When discussing batch normalization, we mentioned that networks benefit from it being easy to represent an identity function. As a further step towards this end, one can include skip connections in a network: one can make one or multiple layers not learn a function \(F\) itself, but its difference from the identity. This would mean that we make a function \(\Delta F\) learnable, and the module output \(F(X) = X + \Delta F(X)\).

ResNet is a CNN with such structure that won the 2015 ImageNet Large Scale Visual Recognition Challenge.

Note that if in a module the sequence dimension is not the same or the number of input and output channels differ, then we cannot include an identity function. But the second best option is to use appropriate projection or repetition maps.

References

[1] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever and Ruslan R. Salakhutdinov: Improving neural networks by preventing co-adaptation of feature detectors. 2012. link

[2] Sergey Ioffe and Christian Szegedy: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, vol. 37, pp. 448--456. link

[3] Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E. Hinton: Layer Normalization. 2016. https://arxiv.org/abs/1607.06450

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun: Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770--778. doi: 10.1109/CVPR.2016.90. link