Representing Local Context

Suppose given an input data entry or a hidden representation \(X^{(l)}\in\mathbf R^{s\times d}\), where \(s\) is a sequence dimension. When producing the next hidden representation \(X^{(l+1)}\), at the sequence index position \(1\le i\le s\), we would like to have a representation of the local context. To this end, we have seen the following options so far:

In a permutation-equivariant layer, we can balance element-level and sequence-level information:

\[ X^{(l+1)} \leftarrow X^{(l)}W_1 + \frac{1}{s}(\mathbf 1\mathbf 1^T)X^{(l)}W_2 \]
In a translation-equivariant, that is convolution layer, we can aggregate context at different sequence elements with the same kernel:

\[ X^{(l+1)}_i \leftarrow \sum_{j=1}^sX^{(l)}_{\overline{i + j}}K_j \]

We saw that to get local context, we can make \(K_j=0\) for \(|j|>C\), for some constant \(C\).

One may wonder: can we use the features \(X^{(l)}_i\) at element \(i\) to decide how much the features \(X^{(l)}_j\) at different element position matter for the local context features \(X^{(l+1)}_i\) at the next hidden representation level?

Dot-Product Attention

We want

\[ X^{(l+1)}_i=\sum_{j=1}^sw_{ij}X^{(l)}_jW_V^{(l)}, \]

where:

the value weight matrix \(W_V^{(l)}\in\mathbf R^{d\times d}\) processes element-level features and
for element indices \(1\le i,j\le s\), the weight \(w_{ij}\) determines how much the level \(l\) representation \(X^{(l)}_j\) of element \(j\) matters for the level-(\(l+1\)) representation of element \(i\).

For stability, we want \(\sum_{j=1}^sw_{ij}=1\). Thus, we take the Data Scientist's default choice:

\[ w_{ij}=\mathop{\mathrm{softmax}}(w'_{ij'}:1\le j'\le s)_j. \]

The idea was first proposed in [1], for machine translation. We follow [2], whence the success of the method stems.

Self-Attention

We want to get the logits \(w'_{ij}\) from the same data entry \(X^{(l)}\in\mathbf R^{s\times d}\). One can get them as scalar products

\[ w'_{ij}\leftarrow\left(X^{(l)}_iW^{(l)}_Q\right)^T\left(X^{(l)}_jW^{(l)}_K\right)=:\mathbf q_i^T\mathbf k_j. \]

We call

\(W^{(l)}_Q,W^{(l)}_K\in\mathbf R^{d\times d}\) the query and key weight matrices and
\(\mathbf q_i,\mathbf k_j\in\mathbf R^d\) query and key vectors.

Scaled Dot-Product Attention

Suppose that we initialize our parameters by the book, and thus the values in \(\mathbf q_i\) and \(\mathbf k_j\) are independent, of mean 0 and variance 1. Then the logit \(w_{ij}'\) will have mean 0 and variance \(d\). To counteract this effect, we divide the logits by \(\sqrt d\).

Multi-Head Attention

It may help training if we aggregate contextual data along not one, but several attention patterns. To this end:

We introduce a new hyperparameter: the number of attention heads \(h\). For simplicity, we assume \(h\mid d\). We let \(d_k=\frac{d}{h}\).
We reshape the query, key and value matrices:

\[ Q=X^{(l)}W^{(l)}_Q,\quad K=X^{(l)}W^{(l)}_K,\quad V=X^{(l)}W^{(l)}_V \]

as 3-tensors in \(\mathbf R^{h\times d_k\times d}\), thus effectively introducing a new batch dimension \(h\).
We aggregate values along the dimension \(d_k\):

\[ \mathop{\mathrm{Attention}}(Q,K,V)=\mathop{\mathrm{softmax}} \left(\frac{Q^TK}{\sqrt{d_k}}\right)V. \]
We reshape the result back to \(\mathbf R^{d\times d}\). Then we multiply it by yet another matrix \(W_O^{(l)}\), the output weight matrix. We denote the result of this by \(\mathop{\mathrm{MultiHead(Q,K,V)}}\) or \(\mathop{\mathrm{MHSA}}^{(l)}(X^{(l)})\).

Multi-Head Self-Attention Blocks

We use the tools we learned about last time: layer normalization, dropout and residual connections to improve multi-head attention.

In the original paper, they used the following formula:

\[ Y^{(l)}=\mathop{\mathrm{LayerNorm}}\left(X^{(l)} + \mathop{\mathrm{dropout}}\left(\mathop{\mathrm{MHSA}}^{(l)}\left( X^{(l)} \right)\right)\right). \]
The above is called Post-LN in [3]. This is because they discovered that what they call Pre-LN:

\[ Y^{(l)}=X^{(l)} + \mathop{\mathrm{dropout}}\left(\mathop{\mathrm{MHSA}}^{(l)}\left( \mathop{\mathrm{LayerNorm}}\left(X^{(l)}\right) \right)\right). \]

usually gives better results.

Feedforward Blocks

We also need to use a nonlinear transformation along the feature dimension. To this end, we use a so-called feedforward block. Let

\[ \mathop{\mathrm{MLP}}^{(l)}(Y)=a(YW_1^{(l)} + \mathbf b_1^{(l)}) W_2^{(l)} + \mathbf b_2^{(l)} \]

denote an MLP with 1 hidden layer.

Then the Post-LN and Pre-LN versions of the feedforward block are

\[ X^{(l+1)}=\mathop{\mathrm{LayerNorm}}\left(Y^{(l)} + \mathop{\mathrm{dropout}}\left(\mathop{\mathrm{MLP}}^{(l)}\left( Y^{(l)} \right)\right)\right). \]

and

\[ X^{(l+1)}=Y^{(l)} + \mathop{\mathrm{dropout}}\left(\mathop{\mathrm{MLP}}^{(l)}\left( \mathop{\mathrm{LayerNorm}}\left(Y^{(l)}\right) \right)\right). \]

Transformer Blocks

The composite of a multi-head self attention and a feedforward block is a transformer block. These are the basic building blocks of transformers.

Note that, so far, our architecture is permutation-equivariant. Therefore, in the lab today, we'll see how much it improves the Travelling SalesAgent. Next time, we'll discuss injecting positional data.

References

[1] Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations (ICLR), 2015. link

[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin: Attention Is All You Need. link

[3] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang and Tieyan Liu: On Layer Normalization in the Transformer Architecture. Proceedings of the 37th International Conference on Machine Learning, PMLR, vol. 119, pp. 10524--10533, 2020. link