5/7 Tokenization and Positional Encodings
Tokenization
Let's recall the issue of tokenization: splitting text to fragments which receive unique identifiers. Such a fragment is called a token. Thus, we convert text to a sequence of integers.
What should be the level of fragmentation?
- If we tokenize characters, we get too long sequences. Also, we can get too many tokens with unicode, although there are approaches that separate unicode charaters to bytes, see eg. [1], which of course further increases the sequence length.
- If we tokenize words, we get too many tokens, some of which are hardly going to get training signals.
Therefore, most of the time, subword tokenization methods are applied. There are data-driven methods that, given a text corpus, try to split it up to subwords such that
- The number of distinct subwords, the vocabulary size, is given as a hyperparameter, and
- The distribution of subword counts is as balanced as possible.
If interested, see [2] for a summary of tokenization algorithms.
Let \(\mathscr V=\{w_i:i\in I\}\) denote the vocabulary and \(\{\mathbf v_i\in\mathbf R^d:i\in I\}\) the trainable token embedding vectors we get.
Special Tokens
We discussed before that to make minibatches of sequences of varying length, we need to make use of padding tokens. In decoded sequences, they are usually denoted by [PAD].
We also add a special token [BOS] in the beginning of a sequence. This is used to collect sequence-level information in supervised models and acts as a start signal for generative models. In generative models, we also use an end of sequence [EOS] token.
Subsequence (eg. in a question-answer pair) end is signalled by [SEP].
Position Embedding
Note that a transformer layer is permutation invariant in the sequence dimension, that is it is so far a particular Deep Set architecture. To add token ordering signals, instead of hard-coded rules, we use position embedding vectors.
Absolute Position Embedding
For each sequential position index \(j\), we take a position embedding vector \(\mathbf p_j\in\mathbf R^d\). Then, if in the input text, we have token \(i\) at index \(j\), the embedding vector \(X^{(0)}_j\in\mathbf R^d\), at position \(j\) is the sum \(\mathbf v_i+\mathbf p_j\).
Learned Absolute Position Embeddings
One possible way to define the vectors \(\mathbf p_j\) is to make them trainable vectors, just like the token embeddings \(\mathbf v_i\). We refer to this as learned position embedding. They are used, for example, in BERT [3], the most often used architecture for text and token classification tasks.
Fixed Absolute Position Embeddings
Another possibility is to define the \(\mathbf p_j\) with a formula. We refer to this as fixed position embedding. For example, in the original transformer [4], they use trigonometric functions: the \(k\)-th component of the \(j\)-th position vector is
They note that this does not have worse performance than learned position embeddings.
Relative Position Embedding
Recall that the sequence entry representations interact through the self-attention operation
where
The idea in relative position embedding is to inject positional information during this quadratic operation.
A general early form [5] is to replace the above formula by
where the 3-tensors \(A_K,A_V\in\mathbf R^{s\times s\times d}\) encode positional information.
An important gain for text processing is that this way, we can encode relative distances between tokens, which models language understanding better. Thus, in [5] it is proposed to form the 3-tensors \(A_K\) and \(A_V\) using relative position vectors \(\mathbf p_{i-j}^K,\mathbf p_{i-j}^V\in\mathbf R^d\). They propose to make these learnable vectors.
For efficient implementation, that makes use of the repetitions in the tensors \(A_K\) and \(A_V\), see [5, Subsection 3.3].
Rotary Position Embedding
Another idea is to make a multiplicative change
using some 4-tensor \(R\in\mathbf R^{s\times s\times d\times d}\). In Rotary Position Embedding (RoPE) [6], they use the fixed tensors
where
(we assume \(d\) is even).
For an efficient implementation, that makes use of the sparsity of the tensor \(R\), see [6, Subsubsection 3.4.2].
References
[1] Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291–306. link
[2] Hugging Face, Summary of the tokenizers. link Last opened on May 6, 2025.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. link
[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin: Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NeurIPS 2017) link
[5] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Computational Linguistics. link
[6] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo and Yunfeng Liu: RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing, vol. 568, pp. , 2024. doi: 10.1016/j.neucom.2023.127063 link