{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d92520d0",
   "metadata": {},
   "source": [
    "# Convex Hull Volume with Transformers\n",
    "\n",
    "Note that, as we defined it, a transformer is a permutation-equivariant model. We will discuss next time how to imbue it with positional information. For now, we shall train a transformer on our permutation-invariant task of choice: convex hull volume."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20d8b89a",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Imports\n",
    "\n",
    "Import `defaultdict`, `matplotlib.pyplot` as `plt`, `ConvexHull` from `scipy.spatial`, `torch`, and `torch.nn.functional` as `F`.\n",
    "\n",
    "Moreover, import the following:\n",
    "1. The function `get_mlp`, that you wrote in Notebook 0319.\n",
    "1. The class `AdamW`, that you wrote in Notebook 0326.\n",
    "2. The function `pbt_init`, that you wrote in Notebook 0328.\n",
    "3. The functions `get_dataloader_random_reshuffle` and `to_ensembled`, that you wrote in Notebook 0416.\n",
    "4. The classes `DictReLU` and `Linear`, and the functions `evaluate_model` and `get_output_by_batches`, that you wrote in Notebook 0423.\n",
    "5. The classes `Dropout` and `LayerNorm`, and the function `train_supervised`, that you wrote in Notebook 0425."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88dc971c",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e4e30f1",
   "metadata": {},
   "source": [
    "### Configuration\n",
    "\n",
    "Create a configuration dictionary with the following keys:\n",
    "- `\"device\"`: `torch.device | int | str`  \n",
    "    The device identifier, explained in notebook 091224.\n",
    "- `\"ensemble_shape\"`: `tuple[int]`  \n",
    "    Make this `(16,)`.\n",
    "- `\"hyperparameter_raw_init_distributions\"`, `\"hyperparameter_raw_perturb\"`, `\"hyperparameter_transforms\"` : `dict`  \n",
    "    These three dictionaries are going to determine how the hyperparameters are tuned. We'll tune the following hyperparameters:\n",
    "    1. Epsilon $\\epsilon$.\n",
    "    2. Learning rate $\\eta$.\n",
    "    3. Weight decay $\\lambda$.\n",
    "    4. First moment moving average decay rate $\\beta_1$.\n",
    "    5. Second moment moving average decay rate $\\beta_2$.\n",
    "    1. Dropout probability $p$.\n",
    "\n",
    "    Of these, we don't know the required order of magnitude of the first three. Thus it may be good to make them distributed along $10^\\mathscr D$ where $\\mathscr D$ is a normal or uniform distribution. You can try to center the distributions at the recommended values.\n",
    "\n",
    "    We know that the recommended values of the fourth and fifth are $0.9$ and $0.999$. So it may be best to give them a distribution of the form $1-10^\\mathscr D$.\n",
    "\n",
    "    We know that the dropout probability should be in the unit interval $[0,1]$. Moreover, it may not help if we zero more than half of the neurons. Thus, let's make its raw initial distribution the uniform distribution on $[0, 0.5]$. For raw perturb, maybe we can use a normal distribution with center $0$ and std $0.1$. For transform function, I recommend clipping the values at $0$ and $1$ as they should be probabilities.\n",
    "- `\"improvement_threshold:`: `float`  \n",
    "    Make this `1e-4`.\n",
    "- `\"minibatch_size\"`: `int`  \n",
    "    Make this `32`.\n",
    "- `\"minibatch_size_eval\"`: `int`  \n",
    "    On my home computer, I can make this `128`.\n",
    "- `\"pbt\"` : `bool`  \n",
    "    Make this `True`.\n",
    "- `\"seed\"`: `int`  \n",
    "    This is for reproducible experiments. Insert any integer.\n",
    "- `\"steps_num\"`: `int`  \n",
    "    Make this `10_001`.\n",
    "- `\"steps_without_improvement`: `int`  \n",
    "    Make this `1000`.\n",
    "- `\"valid_interval\"`: `int`  \n",
    "    Make this `100`.\n",
    "- `\"welch_confidence_level\"`: `float`  \n",
    "    We will exploit based on a one-sided Welch $t$-test with this confidence level. Based on my experiments in the setting of Homework 9, maybe you can try `.8`. Feel free to try out various values here!\n",
    "- `\"welch_sample_size\"`: `int`  \n",
    "    We will exploit based on a one-sided Welch $t$-test on the last this many validation metrics of the population members. To follow the PBT paper, make this `10`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fb6f44d",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73ad9d6e",
   "metadata": {},
   "source": [
    "Set the `torch` pseudo-random number generation seed as per the configuration dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5274e83c",
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.manual_seed(config[\"seed\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "381c2d3a",
   "metadata": {},
   "source": [
    "### Dataset Generation\n",
    "\n",
    "Change `generate_convex_hull_dataset` so that the vertices are at key `features`. Then, generate train, validation and test convex hull volume datasets, with `80_000`, `10_000`, and `10_000` dataset entries. In each, point clouds should have between `10` and `100` entries.\n",
    "\n",
    "Afterwards, create a train dataloader, and get a minibatch out of it. Print keys and value shapes of the minibatch dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab610367",
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_convex_hull_dataset(\n",
    "    ambient_dim: int,\n",
    "    dataset_size: int,\n",
    "    device: int | str | torch.device,\n",
    "    subset_size_max: int,\n",
    "    subset_size_min: int,\n",
    "    std=1.\n",
    ") -> dict:\n",
    "    \"\"\"\n",
    "    Generate a supervised dataset mapping point clouds\n",
    "    to the volumes of their convex hulls.\n",
    "\n",
    "    The number of points in a point cloud\n",
    "    is sampled with uniform distribution from the closed interval\n",
    "    `[subset_size_min, subset_size_max]`\n",
    "    and the coordinates of the points are sampled from\n",
    "    the normal distribution with center `0.` and std `std`.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    ambient_dim : `int`  \n",
    "        The dimension of the Euclidean space\n",
    "        the point clouds are to be finite subsets of.\n",
    "    dataset_size : `int`  \n",
    "        The number of dataset entries to generate.\n",
    "    device : `int | str | torch.device`\n",
    "        Device to store the dataset on.\n",
    "    subset_size_max : `int`  \n",
    "        The maximum number of points in a point cloud.\n",
    "    subset_size_min : `int`  \n",
    "        The minimum number of points in a point cloud.\n",
    "    std : `float`, optional  \n",
    "        The std of the normal distribution\n",
    "        the point coordinates are sampled from.\n",
    "        Default: `1.`\n",
    "\n",
    "    Returns \n",
    "    -------\n",
    "    The dataset, in the form of a dictionary with\n",
    "    `torch.Tensor`-valued keys `\"indptr\"`, `\"features\"` and `\"volume\"`,\n",
    "    which store the dataset as follows:\n",
    "    The `i`-th dataset entry has vertices `features[indptr[i]:indptr[i+1]]`\n",
    "    and volume `volume[i]`.\n",
    "\n",
    "    For ease of use with supervised learning algorithms,\n",
    "    the tensor `volume` is unsqueezed and has shape `(dataset_size, 1)`.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb56c3db",
   "metadata": {},
   "source": [
    "## Assembling a Transformer\n",
    "\n",
    "To approximate convex hull volumes, we'll create a small transformer model with embedding dimension 32, and 4 attention heads. Note that this will make the key dimension 8."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f89bf3c",
   "metadata": {},
   "source": [
    "### Affine Embeddings\n",
    "\n",
    "Note that the input has 2-dimensional features. Thus, we first need to map the features to 32-dimensional vectors. We can use a linear layer for this. You can call it `embedding`. When created, apply it to the input minibatch, and print keys and value shapes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc0d500d",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b0746e5",
   "metadata": {},
   "source": [
    "### `MultiHeadSelfAttentionBlock`\n",
    "\n",
    "Let's implement a Pre-LN multi-head self-attention (MHSA) block.\n",
    "\n",
    "1. In initialization, you should create the following layers:\n",
    "    1. `LayerNorm`\n",
    "    2. Four `Linear` layers, with input and output dimensions `embedding_dim`, to give the key, query, value, and output weight matrices. Note that you'll not need biases.\n",
    "    3. `Dropout`.\n",
    "\n",
    "2. Now when implementing the `forward` method, you can follow the formula:\n",
    "    1. Get an extra reference to the input features, as we'll need them for the skip connetion.\n",
    "    2. Use layer norm.\n",
    "    3. Get key, query and value tensors via their respective weight matrices.\n",
    "        1. Reshape and transpose them to get them in shape `(population_size, minibatch_size, attention_head_num, sequence_length, key_dim)`.\n",
    "        2. Now you can use `F.scaled_dot_product_attention` to calculate the values aggregated by attention. I recommend using this function as it will select the best possible attention implementation, that is available.\n",
    "            1. Recall that the minibatches include a `\"mask\"` entry of shape `(..., sequence_length)`, that say which entries are not padding entries. This information can be fed to the function via the `attn_mask` keyword argument. The attention mask should have shape `(..., sequence_length, sequence_length)`. An entry `[..., i, j]` should be nonzero if and only if either\n",
    "                1. we have `mask[..., i]` and `mask[..., j]` or\n",
    "                2. we have `i == j` (we need this to make sure that softmax doesn't divide by zero).\n",
    "\n",
    "            To create the attention mask, you can use the bitwise and `&` and or `|` operations and broadcasting.\n",
    "\n",
    "    4. Apply the output weight matrix, and then dropout, to the aggregated value vectors.\n",
    "    5. Add the result to the skip connetion. The sum should make the new features entry in the minibatch that is output by the `forward` method.\n",
    "\n",
    "When you're done, apply `pbt_init` to the configuration dictionary, so that the dropout probabilities get initialized , then create a MHSA block, apply it to the output of the embedding layer, and finally print keys and value shapes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "374fe913",
   "metadata": {},
   "outputs": [],
   "source": [
    "class MultiHeadSelfAttentionBlock(torch.nn.Module):\n",
    "    \"\"\"\n",
    "    Pre-LN multi-head self-attention block.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    attention_head_num : `int`\n",
    "        The number of attention heads.\n",
    "    config : `int`\n",
    "        Configuration dictionary. Required key-value pairs:\n",
    "        `\"device\"` : `str`\n",
    "            The device to store parameters on.\n",
    "        `\"dropout_p\"` : `torch.Tensor`\n",
    "            Dropout probability tensor, of shape `ensemble_shape`.\n",
    "        `\"ensemble_shape\"` : `tuple[int]`\n",
    "            Ensemble shape.      \n",
    "    embedding_dim : `int`\n",
    "        The feature dimension of internal representations.\n",
    "\n",
    "    Calling\n",
    "    -------\n",
    "    Instance calls require one positional argument:\n",
    "    batch : `dict`\n",
    "        The input data dictionary. Required keys:\n",
    "        `\"features\"` : `torch.Tensor`\n",
    "            Tensor of element-level features, of shape\n",
    "            `batch_shape + (sequence_dim, embedding_dim)` or\n",
    "            `ensemble_shape + batch_shape + (sequence_dim, embedding_dim)`\n",
    "        `\"mask\"` : `torch.Tensor`\n",
    "            Mask showing which entries are not padding, of shape\n",
    "            `batch_shape + (sequence_dim,)` or\n",
    "            `ensemble_shape + batch_shape + (sequence_dim,)`\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f48314e",
   "metadata": {},
   "source": [
    "### `FeedForwardBlock`\n",
    "\n",
    "One more block to go! This one is pretty straightforward. Once again, test output shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eeeca17a",
   "metadata": {},
   "outputs": [],
   "source": [
    "class FeedForwardBlock(torch.nn.Module):\n",
    "    \"\"\"\n",
    "    Pre-LN feedforward block.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    config : `int`\n",
    "        Configuration dictionary. Required key-value pairs:\n",
    "        `\"device\"` : `str`\n",
    "            The device to store parameters on.\n",
    "        `\"dropout_p\"` : `torch.Tensor`\n",
    "            Dropout probability tensor, of shape `ensemble_shape`.\n",
    "        `\"ensemble_shape\"` : `tuple[int]`\n",
    "            Ensemble shape.  \n",
    "    embedding_dim : `int`\n",
    "        The feature dimension of internal representations.\n",
    "\n",
    "    Calling\n",
    "    -------\n",
    "    Instance calls require one positional argument:\n",
    "    batch : `dict`\n",
    "        The input data dictionary. Required key:\n",
    "        `\"features\"` : `torch.Tensor`\n",
    "            Tensor of element-level features, of shape\n",
    "            `batch_shape + (sequence_dim, embedding_dim)` or\n",
    "            `ensemble_shape + batch_shape + (sequence_dim, embedding_dim)`\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79bd78c9",
   "metadata": {},
   "source": [
    "### `MeanPool`\n",
    "\n",
    "Write a layer that aggregates its input along the sequential dimension. Try it out on the output of the feedforward block."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "147685b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "class MeanPool(torch.nn.Module):\n",
    "    \"\"\"\n",
    "    This module performs mean pool of sequential data\n",
    "    along the sequential dimension.\n",
    "\n",
    "    Arguments\n",
    "    ---------\n",
    "    config : `dict`\n",
    "        Configuration dictionary. Required key-value pair:\n",
    "        `\"ensemble_shape\"` : `tuple[int]`\n",
    "            The shape of the ensemble of models\n",
    "            that process the data.\n",
    "\n",
    "    Calling\n",
    "    -------\n",
    "    The expected input is a dictionary of the following\n",
    "    key-tensor pairs:\n",
    "    `\"features\"`\n",
    "        The feature tensor, of shape\n",
    "        `(..., sequence_dim, feature_dim)`\n",
    "    `\"mask\"`\n",
    "        This mask signals which entry is not a padding element.\n",
    "        It should have size `(..., sequence_dim)`.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b7b3028",
   "metadata": {},
   "source": [
    "### Stack the Blocks\n",
    "\n",
    "Now, you are ready to create the full model, using `torch.nn.Sequential`. It should include:\n",
    "1. An embedding layer,\n",
    "1. A dropout layer,\n",
    "2. A couple of pairs of MHSA and Feedforward blocks, collectively referred to as *transformer blocks*, and\n",
    "3. A mean pool layer.\n",
    "4. An affine layer, to map the 32 embedding dimensions to the 1 target dimension.\n",
    "\n",
    "Make the full model, apply it to a new minibatch, and check output tensor shapes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21fbe760",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca8918ae",
   "metadata": {},
   "source": [
    "## Training\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf72b5d8",
   "metadata": {},
   "source": [
    "Make an `AdamW` optimizer for the model, and train it on the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "352ee337",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "937a8267",
   "metadata": {},
   "source": [
    "Test the best ensemble entry on the test split, with plotting the true and predicted convex hull areas, as we did in Notebook 0416."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13f4a1c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfb2c10f",
   "metadata": {},
   "source": [
    "## License\n",
    "\n",
    "This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2cbb5895",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
