Representing Local Context
Suppose given an input data entry or a hidden representation \(X^{(l)}\in\mathbf R^{s\times d}\), where \(s\) is a sequence dimension. When producing the next hidden representation \(X^{(l+1)}\), at the sequence index position \(1\le i\le s\), we would like to have a representation of the local context. To this end, we have seen the following options so far:
-
In a permutation-equivariant layer, we can balance element-level and sequence-level information:
\[ X^{(l+1)} \leftarrow X^{(l)}W_1 + \frac{1}{s}(\mathbf 1\mathbf 1^T)X^{(l)}W_2 \] -
In a translation-equivariant, that is convolution layer, we can aggregate context at different sequence elements with the same kernel:
\[ X^{(l+1)}_i \leftarrow \sum_{j=1}^sX^{(l)}_{\overline{i + j}}K_j \]We saw that to get local context, we can make \(K_j=0\) for \(|j|>C\), for some constant \(C\).
One may wonder: can we use the features \(X^{(l)}_i\) at element \(i\) to decide how much the features \(X^{(l)}_j\) at different element position matter for the local context features \(X^{(l+1)}_i\) at the next hidden representation level?
Dot-Product Attention
We want
where:
- the value weight matrix \(W_V^{(l)}\in\mathbf R^{d\times d}\) processes element-level features and
- for element indices \(1\le i,j\le s\), the weight \(w_{ij}\) determines how much the level \(l\) representation \(X^{(l)}_j\) of element \(j\) matters for the level-(\(l+1\)) representation of element \(i\).
For stability, we want \(\sum_{j=1}^sw_{ij}=1\). Thus, we take the Data Scientist's default choice:
The idea was first proposed in [1], for machine translation. We follow [2], whence the success of the method stems.
Self-Attention
We want to get the logits \(w'_{ij}\) from the same data entry \(X^{(l)}\in\mathbf R^{s\times d}\). One can get them as scalar products
We call
- \(W^{(l)}_Q,W^{(l)}_K\in\mathbf R^{d\times d}\) the query and key weight matrices and
- \(\mathbf q_i,\mathbf k_j\in\mathbf R^d\) query and key vectors.
Scaled Dot-Product Attention
Suppose that we initialize our parameters by the book, and thus the values in \(\mathbf q_i\) and \(\mathbf k_j\) are independent, of mean 0 and variance 1. Then the logit \(w_{ij}'\) will have mean 0 and variance \(d\). To counteract this effect, we divide the logits by \(\sqrt d\).
Multi-Head Attention
It may help training if we aggregate contextual data along not one, but several attention patterns. To this end:
-
We introduce a new hyperparameter: the number of attention heads \(h\). For simplicity, we assume \(h\mid d\). We let \(d_k=\frac{d}{h}\).
-
We reshape the query, key and value matrices:
\[ Q=X^{(l)}W^{(l)}_Q,\quad K=X^{(l)}W^{(l)}_K,\quad V=X^{(l)}W^{(l)}_V \]as 3-tensors in \(\mathbf R^{h\times d_k\times d}\), thus effectively introducing a new batch dimension \(h\).
-
We aggregate values along the dimension \(d_k\):
\[ \mathop{\mathrm{Attention}}(Q,K,V)=\mathop{\mathrm{softmax}} \left(\frac{Q^TK}{\sqrt{d_k}}\right)V. \] -
We reshape the result back to \(\mathbf R^{d\times d}\). Then we multiply it by yet another matrix \(W_O^{(l)}\), the output weight matrix. We denote the result of this by \(\mathop{\mathrm{MultiHead(Q,K,V)}}\) or \(\mathop{\mathrm{MHSA}}^{(l)}(X^{(l)})\).
Multi-Head Self-Attention Blocks
We use the tools we learned about last time: layer normalization, dropout and residual connections to improve multi-head attention.
-
In the original paper, they used the following formula:
\[ Y^{(l)}=\mathop{\mathrm{LayerNorm}}\left(X^{(l)} + \mathop{\mathrm{dropout}}\left(\mathop{\mathrm{MHSA}}^{(l)}\left( X^{(l)} \right)\right)\right). \] -
The above is called Post-LN in [3]. This is because they discovered that what they call Pre-LN:
\[ Y^{(l)}=X^{(l)} + \mathop{\mathrm{dropout}}\left(\mathop{\mathrm{MHSA}}^{(l)}\left( \mathop{\mathrm{LayerNorm}}\left(X^{(l)}\right) \right)\right). \]usually gives better results.
Feedforward Blocks
We also need to use a nonlinear transformation along the feature dimension. To this end, we use a so-called feedforward block. Let
denote an MLP with 1 hidden layer.
Then the Post-LN and Pre-LN versions of the feedforward block are
and
Transformer Blocks
The composite of a multi-head self attention and a feedforward block is a transformer block. These are the basic building blocks of transformers.
Note that, so far, our architecture is permutation-equivariant. Therefore, in the lab today, we'll see how much it improves the Travelling SalesAgent. Next time, we'll discuss injecting positional data.
References
[1] Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations (ICLR), 2015. link
[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin: Attention Is All You Need. link
[3] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang and Tieyan Liu: On Layer Normalization in the Transformer Architecture. Proceedings of the 37th International Conference on Machine Learning, PMLR, vol. 119, pp. 10524--10533, 2020. link