{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup\n",
    "\n",
    "## Imports\n",
    "\n",
    "1. Import the previously used `defaultdict`, `Generator`, `Sequence`, `plt`, `torch`, `F` and `tqdm`.\n",
    "2. Import `matplotlib` as `mpl` by convention. We'll use this to get a colormap when plotting training curves with different learing rates.\n",
    "3. Import `seaborn` as `sns` by convention. We'll use it to summarize the results of our grid search in a heatmap.\n",
    "4. Put the following functions:\n",
    "    1. `get_accuracy`, created in Notebook 0221\n",
    "    2. `get_cross_entropy`, created in Notebook 0221\n",
    "    3. `get_dataloader_random_reshuffle`, created in Notebook 0221\n",
    "    4. `line_plot_confidence_band`, created in Notebook 0221\n",
    "    5. `load_preprocessed_dataset`, created in Notebook 0219\n",
    "\n",
    "    into a file, or update your previous collection of functions and import them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constants\n",
    "\n",
    "Create a configuration dictionary with the following keys:\n",
    "- `\"dataset_preprocessed_path\"`: `str`  \n",
    "    * If you saved the preprocessed MNIST dataset in notebook 0219,\n",
    "        this should be the path to that file.\n",
    "    * Otherwise, please run notebook 0219 with solutions\n",
    "        so that the preprocessed dataset is saved into `data/mnist.pt`\n",
    "- `\"device\"`: `torch.device | int | str`  \n",
    "    The device identifier.\n",
    "- `\"ensemble_shape\"`: `tuple[int]`  \n",
    "    Make this a `(7, 10)`, as first we'll want to try out the learning rates $\\{10^i:i=-2,-1.5,\\dotsc,1\\}$, each one 10 times.\n",
    "- `\"improvement_threshold`: `float`  \n",
    "    A new validation score should be more than this much better than the best score to count as an improvement. Make this `1e-4`. This is $10^{-4}$ in *Scientific Notation*. More generally, if `a` is a number and `b` is an integer, then `aeb` will input $\\mathtt a\\cdot10^\\mathtt b$.\n",
    "- `\"minibatch_size\"`: `int`  \n",
    "    Make this a `256`.\n",
    "- `\"seed\"`: `int`  \n",
    "    This is for reproducible experiments. Insert any integer.\n",
    "- `\"steps_num\"`: `int`  \n",
    "    Make this a `1000`.\n",
    "- `\"steps_without_improvement`: `int`  \n",
    "    If the number of train steps without having improved a best score reaches this value, we stop training. Make this `100`.\n",
    "- `\"valid_interval\"` : `int`  \n",
    "    The frequency of model evaluation during training,\n",
    "    measured in train steps. Make this a `10`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating the Learning Rate Tensor\n",
    "\n",
    "One of the improvements we'll make to the training loop is to make it possible to vary the learning rate across ensemble members.\n",
    "\n",
    "We want to try out each learning rate in $\\{10^i:i=-2, -1.5,\\dotsc,1\\}$ 10 times. Therefore, we want to get a tensor of shape `(7, 1)` such that its `(i, j)` entry is $10 ^ {\\frac{i-4}{2}}$ (the gradient descent step operation will broadcast).\n",
    "\n",
    "You can use the function `torch.logspace` for this. As we'll want a repeated experiment ensemble dimension to come after the one with the different learning rates, give this tensor's shape a 1 on the right.\n",
    "\n",
    "Print your final result. For debugging, you can start with printing the intermediate results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the learning rate tensor looks OK, assign it to the `\"learning_rate\"` key of the configuration dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the `torch` PRNG state as per the configuration dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Dataset, Create Training Dataloader\n",
    "\n",
    "Load the dataset with `load_preprocessed_dataset`, then create a training dataloader with `get_dataloader_random_reshuffle`. Print the shape, device and datatype of tensors in a minibatch to check if all's well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Updating the Training Loop\n",
    "\n",
    "Take the `train_logistic_regression` function from Notebook 091924 and update it as follows:\n",
    "1. At initialization, create the following objects:\n",
    "    1. A floating point tensor of shape `ensemble_shape`, to keep track of the best validation accuracies per ensemble member. Start it up with a value at most as good as the worst possible evaluation value.\n",
    "    2. A floating point tensor of the same shape as the weights tensor, to keep track of the best weights (the one the best score was achieved with) per ensemble member.\n",
    "    \n",
    "        You can initialize it with `torch.empty_like`. This creates a tensor of the same shape and keyword arguments as its first positional argument. Make the weights tensor that positional argument and override the `requires_grad` keyword argument as for this we will not need to track gradients.\n",
    "        \n",
    "        *empty* says that although the memory will be reserved, it won't be overwritten, so upon initialization, this tensor will be filled with useless values. But we will update it upon the first validation.\n",
    "\n",
    "        Similarly, make a best bias vectors tensor, if you use bias vectors.\n",
    "    3. An integer, starting at 0, to keep track of the number of training steps taken without improvement to any best score, for early stopping.\n",
    "    4. A learning rate value, either a scalar or a tensor of shape broadcastable to that of the weights tensor.\n",
    "        1. First, you need to check if the learning rate value in the configuration dictionary is a tensor. You can do this with the `isinstance` build-in function, which returns the Boolean that says if the object in its first positional argument is of type the class in its second positional argument.\n",
    "        2. If we have a tensor, such as with our current configuration dictionary, then get it and make its shape have two more dimensions 1 on the right.\n",
    "        3. Otherwise, just get the scalar.\n",
    "2. Make it possible for the learning rate to be a tensor of nonempty shape. This means to replace the use of `torch.optim.SGD` with the hands-on approach, just like we did in Notebook 0212. If done correctly, the multiplication of the gradient by the learning rate tensor will broadcast.\n",
    "3. At validation, after having computed the validation accuracy:\n",
    "    1. Create an improvement tensor by subtracting the best scores from the validation accuracy.\n",
    "    2. Create an improvement mask by asking which entries of the improvement tensor are greater than the improvement threshold, as specified in the configuration dictionary.\n",
    "    3. You can check by the `torch.any` function or the `any` method of the mask if there was an improvement in any ensemble member.\n",
    "        1. If there was:\n",
    "            1. By indexing into the tensors with the mask, assign:\n",
    "                1. the masked portion of the validation accuracy to the best scores and\n",
    "                2. the masked portion of the actual weights (and, optionally, bias vectors) to the best weights (and, optionally, bias vectors) and\n",
    "            2. set the steps without improvement counter to 0.\n",
    "        2. Otherwise, increment the steps without improvement counter by the validation interval in trainign steps.\n",
    "4. At checking if the number of training steps reached the maximum number of training steps to decide if we should stop training:\n",
    "    1. insert the alternative condition that the number of training steps without improvement reached its maximum value and\n",
    "    2. if at least one of the stopping conditions hold and we are stopping training, update the output dictionary by the best score and best weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_logistic_regression(\n",
    "    config: dict,\n",
    "    label_num: int,\n",
    "    train_dataloader: Generator[tuple[torch.Tensor, torch.Tensor]],\n",
    "    valid_features: torch.Tensor,\n",
    "    valid_labels: torch.Tensor,\n",
    "    use_bias=True\n",
    ") -> dict:\n",
    "    \"\"\"\n",
    "    Train a logistic regression model on a classification task.\n",
    "    Support model ensembles of arbitrary shape.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    config : dict\n",
    "        Configuration dictionary. Required keys:\n",
    "        ensemble_shape : tuple[int]\n",
    "            The shape of the model ensemble.\n",
    "        improvement_threshold : float\n",
    "            Making the best validation score this much better\n",
    "            counts as an improvement.\n",
    "        learning_rate : float | torch.Tensor\n",
    "            The learning rate of the SGD optimization.\n",
    "            If a tensor, then it should have shape\n",
    "            broadcastable to `ensemble_shape`.\n",
    "            In that case, the members of the ensemble are trained with\n",
    "            different learning rates.\n",
    "        steps_num : int\n",
    "            The maximum number of training steps to take.\n",
    "        steps_without_improvement : int\n",
    "            The maximum number of training steps without improvement to take.\n",
    "        valid_interval : int\n",
    "            The frequency of evaluations,\n",
    "            measured in the number of train steps.\n",
    "    label_num : int\n",
    "        The number of distinct labels in the classification task.\n",
    "    train_dataloader : Generator[tuple[torch.Tensor, torch.Tensor]]\n",
    "        A training minibatch dataloader, that yields pairs of\n",
    "        feature and label tensors indefinitely.\n",
    "        We assume that these have shape\n",
    "        `ensemble_shape + (minibatch_size, feature_dim)`\n",
    "        and `ensemble_shape + (minibatch_size,)`\n",
    "        respectively.\n",
    "    valid_features : torch.Tensor\n",
    "        Validation feature matrix.\n",
    "    valid_labels : torch.Tensor\n",
    "        Validation label vector.\n",
    "    use_bias : bool, optional\n",
    "        Whether to use a bias vector in the logistic regression model.\n",
    "        Default: `True`\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    An output dictionary with the following keys:\n",
    "        best scores : torch.Tensor\n",
    "            The best validation accuracy per each ensemble member\n",
    "        best weights : torch.Tensor\n",
    "            The logistic regression weights\n",
    "            that were the best per each ensemble member.\n",
    "        training accuracy : torch.Tensor\n",
    "            The tensor of training accuracies, of shape\n",
    "            `(evaluation_num,) + ensemble_shape`.\n",
    "        training cross-entropy : torch.Tensor\n",
    "            The tensor of training cross-entropies, of shape\n",
    "            `(evaluation_num,) + ensemble_shape`.\n",
    "        training steps : list[int]\n",
    "            The list of the number of training steps at each evaluation.\n",
    "        validation accuracy : torch.Tensor\n",
    "            The tensor of validation accuracies, of shape\n",
    "            `(evaluation_num,) + ensemble_shape`.\n",
    "        validation cross-entropy : torch.Tensor\n",
    "            The tensor of validation cross-entropies, of shape\n",
    "            `(evaluation_num,) + ensemble_shape`.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiments\n",
    "\n",
    "## Making Temporary Configuration Changes\n",
    "\n",
    "Now we'll want to run the training loop with temporary changes to the configuration dictionary. You can use the *bitwise or operator* `|` for this: given two dictionaries, it creates a copy of the dictionary on the left and updates it with the values of the dictionary on the right.\n",
    "\n",
    "First, we want to train with 10 train steps only so that we can quickly test plotting the results. To record validation results, we can set the validation interval to 1.\n",
    "\n",
    "Make an appropriately changed copy of the configuration dictionary. Make sure to assign it to a variable named differently than the original configuration dictionary. Print the original and the new dictionary, to be safe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting Training Curves with Various Learning Rates\n",
    "\n",
    "1. Run training and assign the output to a variable.\n",
    "2. At the `validation accuracy` key of the output, you find a tensor of shape `(evaluation_num, 7, 10)` that records the validation accuracies at each evaluation of the ensemble members.\n",
    "3. We want to loop over pairs of learning rates and corresponding validation accuracy matrices to make a joint plot of them with `line_plot_confidence_band`.\n",
    "    1. To loop over the learning rate values, you can reshape the learning rate tensor in the configuration dictionary to a vector.\n",
    "    2. When you loop over a tensor, you are looping over its dimension of index 0. Therefore, to loop over the validation accuracy tensors corresponding to different learning rates, you need to swap dimensions in the validation accuracy tensor. You can use its method `transpose`.\n",
    "    3. To loop over two iterables at once and yield pairs from the two iterable, you can use the `zip` built-in function.\n",
    "4. Do this, and for each pair of learning rates and corresponding validation accuracy matrices, make a `line_plot_confidence_band` with:\n",
    "    1. training steps as x-values,\n",
    "    2. validation accuracies as y-values, and\n",
    "    3. the learning rate as label.\n",
    "        1. Use the `format` built-in function to transform the scalar tensor to a better readable string.\n",
    "            1. Make the first positional argument the learning rate.\n",
    "            2. Make the second positional argument `\".2f\"`. This will output the scalar in fixed-point notation, with 2 digits following the decimal point. You can read about format string syntax here:  \n",
    "            https://docs.python.org/3/library/string.html#formatspec\n",
    "5. Finish the diagram\n",
    "    1. Plot the legend with `plt.legend`. Specify in the `title` keyword argument that the values are learning rates.\n",
    "    2. Give `plt.title`, `plt.xlabel` and `plt.ylabel`\n",
    "    3. Show and clear the canvas.\n",
    "6. If the plot looks OK, replace the modified configuration dictionary with the original and rerun the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using a colormap\n",
    "\n",
    "Well, the seemingly random colors of the training curves can make it somewhat difficult to interpret this diagram. It would be better if the change in color was in relation with the change in learning rate. We can use an `mpl.color.Colormap` for this. This is a color gradient, parametrized by the values on the unit interval $[0,1]$.\n",
    "\n",
    "1. Peruse the choices here:  \n",
    "https://matplotlib.org/stable/users/explain/colors/colormaps.html  \n",
    "and load a colormap by indexing into `mpl.colormaps` with the name of your chosen colormap.\n",
    "2. The `mpl.color.Colormap` object you got is `Callable`, that is it is like a function. Calling it with an array or tensor of values in $[0,1]$ returns an array of the corresponding colors (as RGBA values). Call it with a `torch.linspace` of length suitable for giving one color per learning rate. Print what you get to see."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take the plotting loop of the previous cell.\n",
    "1. Insert the array of colors you got into the `zip`\n",
    "2. Change the iteration so that it is over triples.\n",
    "3. Feed the color you get to the `color `keyword argument of `line_plot_confidence_band`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How would you interpret this diagram?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Box Plot of Best Scores\n",
    "\n",
    "Let's make a diagram that represents the distributions of best scores per learning rate. A common choice for this is a *box plot*. Just like confidence intervals, this also represents the variance in the distributions:  \n",
    "https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html\n",
    "\n",
    "1. The positional argument of `plt.boxplot` is expected to be a table, which is aggregated columnwise. Transform the best scores tensor accordingly. You may also need to change the device of the tensor.\n",
    "2. You can specify the labels of the boxes via the `tick_labels` keyword argument. Make this the `format`ted learning rates.\n",
    "3. Give your plot a title and axis labels. Show it and clear the canvas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Heatmap of Learning Rates and Minibatch Sizes\n",
    "\n",
    "Finally, let's perform a grid search on the set of learning rates as before and the set of minibatch sizes $\\{2^i:i=3,\\dotsc,10\\}$. As different minibatch sizes yield different shaped minibatches, we can't make that into another ensemble dimension. So instead of the parallel approach, we stick to the serial one.\n",
    "\n",
    "1. As we will perform 8 times the previous training, if you don't have access to a GPU, change the ensemble shape to `(7, 1)` in the configuration dictionary. This will yield noisier results, but it will finish faster.\n",
    "1. Create an empty list to store best score tensors in.\n",
    "2. Iterate over the selected minibatch sizes.\n",
    "    1. For each minibatch size, create a copy of the configuration dictionary and update it with the minibatch size.\n",
    "    2. Create a new training dataloader with the modified configuration dictionary. We need to do this as the dataloader depends on the minibatch size.\n",
    "    3. Run training with the modified configuration dictionary and the new training dataloader.\n",
    "    4. Take the best scores tensor of the result.\n",
    "        1. Take its mean along the last dimension as we're recording sample means of different hyperparameters.\n",
    "        2. Change its device to `\"cpu\"`.\n",
    "        3. Append the result to the best scores list.\n",
    "3. Print the best scores list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's make a heatmap of these best scores! We'll do this via the function in `seaborn`:  \n",
    "https://seaborn.pydata.org/generated/seaborn.heatmap.html  \n",
    "1. The positional argument can be a table we'll make a heatmap out of. That is, we can make this the best scores list.\n",
    "2. Make the `annot` keyword argument `True` so that the values are printed on the heatmap.\n",
    "3. As the rows of the table are results of training with different minibatch sizes, make the `yticklabels` keyword argument the list of minibatch sizes.\n",
    "4. As a row of the table gives the results of training with different learning rates, make the `xticklabels` keyword argument a list of formatted learning rates.\n",
    "5. Give the colorbar a label by making the `cbar_kws` keyword argument a dictionary with single key `\"label\"` and single value the label.\n",
    "6. Set the plot title and give labels to the x and y axes.\n",
    "7. Display the canvas and clear it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What conclusions can you draw from the data you gathered:\n",
    "1. the training curves,\n",
    "2. the box plot,\n",
    "3. the times to train with different minibatch sizes and\n",
    "4. the heat map?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Datasets\n",
    "\n",
    "## MNIST\n",
    "\n",
    "https://huggingface.co/datasets/ylecun/mnist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# License\n",
    "\n",
    "This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
