Extra Homework 2

Let's train a transformer model on the dair-ai/emotion dataset, using fixed positional encoding.

Note that in Notebook 0430, we used a transformer model with continuous input. To adapt it to text input, we need to prepend it with an embedding layer. You can start out from the Embedding layer you created in Notebook 0409. Make the following changes:

Use dictionary inputs and outputs.
Implement the option to include fixed absolute positional encoding.

class Embedding(torch.nn.Module):
    """
    Ensemble-ready embedding.

    Arguments
    ---------
    config : `dict`
        Configuration dictionary. Required key-value pairs:
        `"device"` : `str`
            The device to store parameters on.
        `"ensemble_shape"` : `tuple[int]`
            The shape of the ensemble of affine transformations
            the model represents.
    embedding_dim : `int`
        The number of embedding dimensions.
    vocabulary_size : `int`
        The number of vocabulary entries.
    positional_encoding : `bool`
        Whether to add positional encoding to the embedding,
        as described in the paper "Attention is All You Need".
        Default: `False`.

    Calling
    -------
    Instance calls require one positional argument:
    batch : `dict`
        The input data dictionary. Required key:
        `"token_ids"` : `torch.Tensor`
            Tensor of token IDs. It is required to be one of the following shapes:
            1. `ensemble_shape + batch_shape`
            2. `batch_shape`

        Upon a call, the model thinks we're in the first case
        if the first `len(ensemble_shape)` many entries of the
        shape of the input tensor is `ensemble_shape`.
"""

Load the dataset, just like in Homework 10.
Make a transformer as a sequential model of:
1. an embedding, with embedding dimension 32, and positional encoding
2. dropout
3. 2 transformer blocks (pairs of multi-head attention and feedforward blocks) with 4 attention heads.
4. mean pool
5. a linear layer, to map the 32 embedding dimensions to the number of classes in the dataset.
Train the model on the dataset, using train_supervised. Remember to set the keyword arguments out_features and target_key.