4/23 Convolutional Neural Networks

History

Convolutional Neural Networks (CNNs) are among the oldest studied ANNs.

In the 50s and 60s, Hubel and Wiesel found out that cat visual cortices contain neurons responding to small regions of the visual fields. These partial informations are then aggregated by other neurons. [1]

1979: Based on this idea, Kunihiko Fukushima constructed a proto-CNN called neocogitron [2]

1989: Using backpropagation, LeCun et al trained LeNet, a CNN for ZIP code recognition. It was successfully used by the UPS. [3]

2004: Oh and Jung wrote a GPU implementation for MLPs, achieving a speedup by the factor of 20. [4]

2011-12: Cireşan et al wrote a GPU implementation for CNNs, achieving a speedup by the factor of 60. [5] With this, they created a CNN ensemble [6], that won a German traffic sign recognition competition [7], surpassing the accuracy of human annotators.

2012: Krizhevsky et al created AlexNet [8], a similar GPU-based CNN [9], that won the 2012 ImageNet competition by a large margin. This event is viewed as a milestone for Deep Learning dominance.

Convolution: a Translation-Invariant Operation

Recall that in Notebook 0418, we described the linear maps \(\mathbf R^{m\times d}\to\mathbf R^{m\times d'}\) that are equivariant under the \(S_m\)-action given by permutation of matrix rows.

This time, we are interested in a less restrictive action: equivariance under cyclical shifts of the row indices, that is under the subgroup of \(S_m\) generated by the map \(\pi(i) = i + 1\mod m\).

The interest comes from pattern recognition: a 1D sequence can be, for example

text represented as a sequence of word vectors, or
sound represented as a sequence of wave components.

One can check by hand the following:

Theorem. A linear map

\[ \mathbf R^{m\times d}\xrightarrow L\mathbf R^{m\times d'} \]

is equivariant under cyclical shifts of the matrix rows if and only if there exist matrices \(W_1,\dotsc,W_m\in\mathbf R^{d\times d'}\) such that for \(X\in\mathbf R^{m\times d}\), we have

\[ L(X)_i=\sum_{k=1}^mX_{i + k\mod m}W_k. \]

We can similarly form linear maps \(\mathbf R^{m\times n\times d}\to\mathbf R ^{m\times n\times d'}\) that are equivariant under horizontal and vertical cyclical shifts for image pattern recognition, etc.

Locality and Kernels

Note that the above construction requires fixing the sequence shape (m,), (m,n,), etc. Moreover, inspired by cats, we want our model to only consider local connections. This means to fix a kernel, that is selecting a collection of sequence index displacements that will get nonzero weights and disabling modular equivalence of indices.

To help understand convolution kernels, let's write up the exact formula:

(Hyper)parameters:
1. Sequence size \(m\)
2. Kernel size \(k\)
3. Dilation \(l\)
4. Padding \(p_1,p_2\)
5. Stride \(s\)
Displacements: \(I=\{0, l, \dotsc, l(k-1)\}\).
Base indices: \(J=\{-p_1,-p_1 + s,\dotsc,s(m'-1)\rfloor\}\) where \(m'=\lfloor\frac{m-1+p_2}{s}\rfloor\).
Then the matrices \(\{W_i\in\mathbf R^{d\times d'}:i=0,\dotsc,k-1\}\) determine a map \(\mathbf R^{m\times d}\xrightarrow L\mathbf R^{m'\times d'}\) where

\[ L(X)_j=\sum\{X_{li+j}W_i:i=0,\dotsc,k-1, 0\le li+j<m\}. \]

Pooling

Local data can be aggregated by pooling operations. These have similar formulas as the kernel-based convolutions above, but:

instead of summation, we usually use mean or maximum and
the linear maps \(W_i\) are identities.

References

[1] David Hunter Hubel and Torsten Nils Wiesel: Receptive fields of single neurones in the cat's striate cortex. The Journal of Physiology, 1959, vol 148 (3), pp. 574--591. doi: https://doi.org/10.1113/jphysiol.1959.sp006308

[2] Kunihiko Fukushima: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 1980, vol. 36, pp. 193--202. doi: 10.1007/BF00344251, link

[3] Yann LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1989, vol. 1 (4), pp. 541--551. doi: 10.1162/neco.1989.1.4.541 link

[4] Kyoung-Su Oh, Keechul Jung: GPU implementation of neural networks. Pattern Recognition, 2004, vol. 37 (6), pp. 1311--1314 doi: 10.1016/j.patcog.2004.01.013 link

[5] Dan C. Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella and Jürgen Schmidhuber: Flexible, High Performance Convolutional Neural Networks for Image Classification. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), 2011, vol. 2, pp. 1237--1242. link

[6] Dan Cireşan, Ueli Meier, Jonathan Masci and Jürgen Schmidhuber: Multi-column deep neural network for traffic sign classification. Neural Networks, 2012, vol. 32, pp. 333--338. doi: 10.1016/j.neunet.2012.02.023 link

[8] Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton: ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (NeurIPS 2012). link

Dataset References

[7] Johannes Stallkamp, Marc Schlipsing, Jan Salmen and Christian Igel: The German Traffic Sign Recognition Benchmark: A multi-class classification competition. The 2011 International Joint Conference on Neural Networks. doi: 10.1109/IJCNN.2011.6033395 link

[9] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei: * ImageNet Large Scale Visual Recognition Challenge*. International Journal of Computer Vision, 2015, vol. 115, pp. 211--252. doi: 10.1007/s11263-015-0816-y dataset link paper link