{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "661d2ea8",
   "metadata": {},
   "source": [
    "# Implementing Dropout and Layer Normalization\n",
    "\n",
    "## Setup\n",
    "\n",
    "### Imports\n",
    "\n",
    "Import `ABC`, `abstractmethod`, `defaultdict`, `Callable`, `Iterable`, `datasets`, `itertools`, `Optional`, `os`, `torch` and `tqdm`.\n",
    "\n",
    "Moreover, import the following:\n",
    "1. The functions `get_accuracy` and `get_cross_entropy`, that you wrote in Notebook 0221.\n",
    "1. The function `normalize_features`, that you wrote in Notebook 0321.\n",
    "1. The classes `AdamW` and `Optimizer`, that you wrote in Notebook 0326.\n",
    "1. The functions `pbt_init` and `pbt_update`, that you wrote in Notebook 0328.\n",
    "3. The functions `get_dataloader_random_reshuffle` and `to_ensembled`, that you wrote in Notebook 0416.\n",
    "5. The classes `Conv2D`, `DictReLU`, `Linear`, and `Pool2D`, and the function `evaluate_model`,  that you wrote in Notebook 0423."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f184e86",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f79c4fc7",
   "metadata": {},
   "source": [
    "### Configuration\n",
    "\n",
    "Create a configuration dictionary with the following keys:\n",
    "- `\"dataset_path\"`: `str`  \n",
    "    Make this the ID of CIFAR-10 [10], to be found at https://huggingface.co/datasets/uoft-cs/cifar10.\n",
    "- `\"dataset_preprocessed_path\"` : `str`  \n",
    "    Below, I'm going to suggest saving the preprocessed dataset at this location.\n",
    "- `\"device\"`: `torch.device | int | str`  \n",
    "    The device identifier.\n",
    "- `\"ensemble_shape\"`: `tuple[int]`  \n",
    "    Make this `(16,)`.\n",
    "- `\"hyperparameter_raw_init_distributions\"`, `\"hyperparameter_raw_perturb\"`, `\"hyperparameter_transforms\"` : `dict`  \n",
    "    These three dictionaries are going to determine how the hyperparameters are tuned. We'll tune the following hyperparameters:\n",
    "    1. Epsilon $\\epsilon$.\n",
    "    2. Learning rate $\\eta$.\n",
    "    3. Weight decay $\\lambda$.\n",
    "    4. First moment moving average decay rate $\\beta_1$.\n",
    "    5. Second moment moving average decay rate $\\beta_2$.\n",
    "    1. Dropout probability $p$.\n",
    "\n",
    "    Of these, we don't know the required order of magnitude of the first three. Thus it may be good to make them distributed along $10^\\mathscr D$ where $\\mathscr D$ is a normal or uniform distribution. You can try to center the distributions at the recommended values.\n",
    "\n",
    "    We know that the recommended values of the fourth and fifth are $0.9$ and $0.999$. So it may be best to give them a distribution of the form $1-10^\\mathscr D$.\n",
    "\n",
    "    We know that the dropout probability should be in the unit interval $[0,1]$. Moreover, it may not help if we zero more than half of the neurons. Thus, let's make its raw initial distribution the uniform distribution on $[0, 0.5]$. For raw perturb, maybe we can use a normal distribution with center $0$ and std $0.1$. For transform function, I recommend clipping the values at $0$ and $1$ as they should be probabilities.\n",
    "- `\"improvement_threshold:`: `float`  \n",
    "    Make this `1e-4`.\n",
    "- `\"minibatch_size\"`: `int`  \n",
    "    Make this a `64`.\n",
    "- `\"minibatch_size_eval\"`: `int`  \n",
    "    On my home computer, I can make this `128`.\n",
    "- `\"pbt\"` : `bool`  \n",
    "    Make this `True`.\n",
    "- `\"seed\"`: `int`  \n",
    "    This is for reproducible experiments. Insert any integer.\n",
    "- `\"steps_num\"`: `int`  \n",
    "    Make this `10_001`.\n",
    "- `\"steps_without_improvement`: `int`  \n",
    "    Make this `10_000`.\n",
    "- `\"valid_interval\"`: `int`  \n",
    "    Make this `1000`.\n",
    "- `\"welch_confidence_level\"`: `float`  \n",
    "    We will exploit based on a one-sided Welch $t$-test with this confidence level. Based on my experiments in the setting of Homework 9, maybe you can try `.8`. Feel free to try out various values here!\n",
    "- `\"welch_sample_size\"`: `int`  \n",
    "    We will exploit based on a one-sided Welch $t$-test on the last this many validation metrics of the population members. To follow the PBT paper, make this `10`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92500ed8",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af797e3a",
   "metadata": {},
   "source": [
    "Set the `torch` pseudo-random number generation seed, as per the configuration dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "885d1a14",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf1a2ed1",
   "metadata": {},
   "source": [
    "## Load and Preprocess Dataset\n",
    "\n",
    "Just like in Notebook 0423, load and preprocess CIFAR-10.\n",
    "\n",
    "Actually, I suggest first checking if the path `\"dataset_preprocessed_path\"` exists. If not, then load and preprocess the dataset, then save it to this location.\n",
    "\n",
    "In either case, you can just load the preprocessed dataset afterwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "842bf118",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "626f0549",
   "metadata": {},
   "source": [
    "### Get Flattenet Datasets\n",
    "\n",
    "For quicker testing, first, we'll add dropout and layer normalization to an MLP. To be used with an MLP, first create train and validation split datasets with flattened features. Check if the feature tensors you got are 2-dimensional."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71f7031d",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_train = {\n",
    "    \"features\": train_features.flatten(1),\n",
    "    \"label\": train_labels\n",
    "}\n",
    "dataset_valid = {\n",
    "    \"features\": valid_features.flatten(1),\n",
    "    \"label\": valid_labels\n",
    "}\n",
    "\n",
    "for d in (dataset_train, dataset_valid):\n",
    "    for key in (\"features\", \"label\"):\n",
    "        print(key, d[key].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "240455e5",
   "metadata": {},
   "source": [
    "## Training Code Updates\n",
    "\n",
    "### `update_model`\n",
    "\n",
    "Note that we have a new hyperparameter `dropout_p`, that determines the probability that a given feature entry will be dropped by dropout. In our setup, hyperparameters are tracked in a dictionary. So far, we only needed to update the optimizer by the changes in this dictionary. Now, we need this for a model too. To this end, implement the function below. You can iterate over the submodules of a `torch.nn.Module` by its `modules` method. Among the iterates, if one has an attribute `config`, use its `update` method to send the hyperparameter updates to the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "595f2b91",
   "metadata": {},
   "outputs": [],
   "source": [
    "def update_model(\n",
    "    config: dict,\n",
    "    model: torch.nn.Module\n",
    "):\n",
    "    \"\"\"\n",
    "    Update the configuration dictionary of a model.\n",
    "    We iterate over its submodules and whichever has a `config` attribute,\n",
    "    we update it by the included `config` dictionary.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    config : `dict`\n",
    "        The updated configuration dictionary.\n",
    "    model : `torch.nn.Module`\n",
    "        The model to update.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23abc965",
   "metadata": {},
   "source": [
    "## `model.train()` and `model.eval()`\n",
    "\n",
    "We discussed in the lecture, that there are layers such as dropout and batch normalization, that behave differently during training and evaluation. You can switch between these modes by calling the `train` and `eval` methods of the model before training and evaluation steps, respectively.\n",
    "\n",
    "Make this change to the function `train_supervised` you wrote in Notebook 0423. Moreover, still in the function `train_supervised`, call the function `update_model` after calls to the functions `pbt_init` or `pbt_update`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ceb99cec",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffa2c5f1",
   "metadata": {},
   "source": [
    "### Train an MLP\n",
    "\n",
    "Create an MLP as `torch.Sequential` of `Linear` and `DictReLU` layers and an `AdamW` optimizer to optimize its parameters. Train it via `train_supervised`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "240b8607",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aee13798",
   "metadata": {},
   "source": [
    "Call `del` on the model and the optimizer to delete them and release memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b3e2db1",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "860090bf",
   "metadata": {},
   "source": [
    "## Implementing Dropout\n",
    "\n",
    "The built-in `Dropout` layer of `torch` records the dropout probability at the attribute `p`: [click here](https://github.com/pytorch/pytorch/blob/78953ee1223391df5c162ac6d7e3eb70294a722e/torch/nn/modules/dropout.py#L35) to access the source code.\n",
    "\n",
    "We would like to tune the dropout probability through our PBT machinery, which stores hyperparameters in a configuration dictionary. Thus, we'll write our own `Dropout` layer.\n",
    "\n",
    "Note that the `\"dropout_p\"` entry of the configuration dictionary will be a tensor of shape `ensemble_shape` of dropout probabilities. So, upon receiving a feature tensor (in our present setting, as the `\"features\"` entry of a data dictionary):\n",
    "1. Check if the model is in training mode, via its `training` attribute. If not, just return the input batch dictionary.\n",
    "2. Apply the `to_ensembled` function to the feature tensor, to make sure that it includes ensemble dimensions.\n",
    "3. Broadcast the dropout probabilities to the right, by first adding an appropriate number of dimension 1's to the right of its shape, to match the shape of the feature tensor.\n",
    "4. Multiply the feature tensor by $\\frac{1}{1-p+\\epsilon}$, where $p$ are the dropout probabilities and $\\epsilon$ is a small number added for numerical stability.\n",
    "5. Get a sample from the uniform distribution on the unit interval of the same shape as the feature tensor.\n",
    "6. Get a mask of values where the sample is larger than the dropout probabilities.\n",
    "7. Multiplying the features with the mask, you can affect dropout.\n",
    "\n",
    "Write the layer, then create an MLP where in front of each affine transformation, you include a dropout layer. Train this too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e98a6e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Dropout(torch.nn.Module):\n",
    "    \"\"\"\n",
    "    Ensemble-ready dropout layer.\n",
    "\n",
    "    Arguments\n",
    "    ---------\n",
    "    config : `dict`\n",
    "        Configuration dictionary. Required key-value pairs:\n",
    "        `\"dropout_p\"` : `torch.Tensor`\n",
    "            Dropout probability tensor, of shape `ensemble_shape`.\n",
    "        `\"ensemble_shape\"` : `tuple[int]`\n",
    "            The shape of the ensemble of affine transformations\n",
    "            the model represents.\n",
    "\n",
    "    Calling\n",
    "    -------\n",
    "    Instance calls require one positional argument:\n",
    "    batch : `dict`\n",
    "        The input data dictionary. Required key:\n",
    "        `\"features\"` : `torch.Tensor`\n",
    "            Tensor of features.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b5630c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "614bee26",
   "metadata": {},
   "source": [
    "Let's again delete the model and the optimizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fca5bfb",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cc2a348",
   "metadata": {},
   "source": [
    "## Implementing Layer Normalization\n",
    "\n",
    "Now for layer normalization! Once again, the built-in `LayerNorm` is not up to the task, as we can't train different scale and offset tensors per ensemble member.\n",
    "\n",
    "Moreover, we include a keyword argument that is not in the built-in version: `normalized_offset`. We need this when we want to normalize along dimensions that are not the last in the shape. For example, today, after adding layer normalization layers to the MLP, we'll do the same for the CNN, but we'll only normalize along the feature dimension, that is before the sequence dimensions in the `torch` image processing convention.\n",
    "\n",
    "Write the new layer as per the docstrings. Then add layer normalization layers in front of the dropout layers in the MLP and train it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20c0def9",
   "metadata": {},
   "outputs": [],
   "source": [
    "class LayerNorm(torch.nn.Module):\n",
    "    \"\"\"\n",
    "    Ensemble-ready layer normalization layer\n",
    "\n",
    "    Arguments\n",
    "    ---------\n",
    "    config : `dict`\n",
    "        Configuration dictionary. Required key-value pairs:\n",
    "        `\"device\"` : `str`\n",
    "            The device to store parameters on.\n",
    "        `\"ensemble_shape\"` : `tuple[int]`\n",
    "            The shape of the ensemble of affine transformations\n",
    "            the model represents.\n",
    "    normalized_shape : `int | tuple[int]`\n",
    "        The part of the shape of the incoming tensors\n",
    "        that are to be normalized together with batch dimensions.\n",
    "        We view the following as batch dimensions:\n",
    "        ```\n",
    "        range(\n",
    "            len(ensemble_shape),\n",
    "            -len(normalized_shape) - normalized_offset\n",
    "        )\n",
    "        ```\n",
    "        If an integer, we view it as a single-element tuple.\n",
    "    bias : `bool`, optional\n",
    "        If `elementwise_affine`, whether to include offset\n",
    "        in the learned transformation. Default: `True`.\n",
    "    elementwise_affine : `bool`, optional\n",
    "        Whether to include learnable scale. If this and `bias`,\n",
    "        then we also include learnable offset. These will be tensors\n",
    "        of shape `ensemble_shape + normalized_shape` that are\n",
    "        broadcast to the incoming feature tensors appropriately.\n",
    "        Default: `True`.\n",
    "    epsilon : `float`, optional\n",
    "        Small positive value, to be included in the divisor when we\n",
    "        divide by the variance, for numerical stability. Default: `1e-5`.\n",
    "    normalized_offset : `int`, optional\n",
    "        We get `normalized_shape` out of an incoming feature tensor\n",
    "        at dimensions\n",
    "        ```\n",
    "        range(\n",
    "            -len(normalized_shape) - normalized_offset,\n",
    "            -normalized_offset\n",
    "        )\n",
    "        ```\n",
    "        Default: `0`.\n",
    "\n",
    "    Calling\n",
    "    -------\n",
    "    Instance calls require one positional argument:\n",
    "    batch : `dict`\n",
    "        The input data dictionary. Required key:\n",
    "        `\"features\"` : `torch.Tensor`\n",
    "            Tensor of features.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6bd6dd4",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f082b403",
   "metadata": {},
   "source": [
    "Delete the model and the optimizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f91d7de",
   "metadata": {},
   "outputs": [],
   "source": [
    "del model, optimizer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e806991",
   "metadata": {},
   "source": [
    "## Try it on a CNN!\n",
    "\n",
    "Take the CNN you used last time and give it layer normalization (set `normalize_offset` so that you only normalize along the channel dimension) and dropout layers similary to how you did for the MLP. Train it on training and validation datasets with the unflattened features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d254075c",
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7e92286",
   "metadata": {},
   "source": [
    "## Dataset References\n",
    "\n",
    "[5] Alex Krizhevsky: *Learning Multiple Layers of Features from Tiny Images*. 2009. https://www.cs.toronto.edu/~kriz/cifar.html"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19ea7254",
   "metadata": {},
   "source": [
    "## License\n",
    "\n",
    "This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a7412d3",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}