{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Welcome to the first lab notebook! I am giving you instructions on what you should code. You should replace the error calls\n",
    "```python\n",
    "raise NotImplementedError\n",
    "```\n",
    "by the required code. This way, if you run the entire notebook, it will stop at the first time you are yet to provide an answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup\n",
    "\n",
    "## Imports\n",
    "\n",
    "By *import* we mean *importing a Python module or from a Python module*. We load the functionalities we want with this.\n",
    "More info on this:\n",
    "- https://docs.python.org/3/reference/import.html\n",
    "- https://stackoverflow.com/a/19198449\n",
    "\n",
    "I highly recommend keeping all imports on top of the notebook, preferably ordered in some way, for example using the lexicographic ordering.\n",
    "\n",
    "This once, I wrote the import statements for you. We are importing the following two modules:\n",
    "\n",
    "2. The `matplotlib` library: https://matplotlib.org/\n",
    "is the canonical visualization tool in python.\n",
    "Its `pyplot` module provides a visualization interface.\n",
    "It is by convention imported with the alias `plt`.\n",
    "3. We'll use `torch`: https://pytorch.org/ for most of our calculations.\n",
    "This is a machine learning library focused on Deep Learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constants\n",
    "\n",
    "I also recommend keeping constants, mostly configuration settings, on the top of the notebook. This way, it is easy to look them up later to change them or summarize them, for example when describing an experiment.\n",
    "\n",
    "Moreover, I suggest keeping the constants in a `dictionary`  \n",
    "https://docs.python.org/3/library/stdtypes.html#mapping-types-dict  \n",
    "that is a collection of key-value pairs. This way, it is easy to input the full collection of configuration settings to functions.\n",
    "\n",
    "This once, I created the configuration dictionary for you. We'll see as we go what the settings are for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = {\n",
    "    \"seed\": 1,\n",
    "    \"train_size\": .85\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set PRNG\n",
    "\n",
    "We can set the state of the pseudorandom number generator (PRNG) of `torch`, by calling the function `torch.manual_seed` and giving it as single argument the configuration setting `config[\"seed\"]`.\n",
    "\n",
    "1. Note that you can call the function as `torch.manual_seed`. This means that the function `manual_seed` is defined in the module `torch` that you imported.\n",
    "\n",
    "    Most generally used Python libraries have detailed documention that I recommend to get in the habit of perusing. For example, you can find the documentation for `torch.manual_seed` here:  \n",
    "    https://pytorch.org/docs/stable/generated/torch.manual_seed.html#torch-manual-seed  \n",
    "    You can see that there is a search block where you can look up other functions and classes in `torch` that we use.\n",
    "\n",
    "2. Note that you can get the seed we set via `config[\"seed\"]`. That is, we get the value of the dictionary `config` at key `\"seed\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.manual_seed(config[\"seed\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic Tensor Indexing\n",
    "\n",
    "On extracting entries and subtensors, see Basic indexing here:  \n",
    "https://numpy.org/doc/stable/user/basics.indexing.html#basic-indexing\n",
    "\n",
    "`numpy` is a multidimensional array manipulation library.\n",
    "It does not offer GPU support or automatic differentiation.\n",
    "For our purposes, you can view it as a predecessor to `pytorch`.\n",
    "Array indexing works the same way.\n",
    "Only, you should use `None` instead of `np.newaxis`.\n",
    "\n",
    "I'm giving you a matrix and will ask you to print various parts of it.\n",
    "As the numbers in the matrix follow a clear pattern,\n",
    "you can easily doublecheck your work\n",
    "\n",
    "Use the `print` function to print the following parts:\n",
    "1. The second row\n",
    "2. The third column\n",
    "3. The entry that is the intersection of the above two\n",
    "4. The first two rows.\n",
    "5. The last three columns.\n",
    "6. Every second row, starting from the second.\n",
    "7. The last row as a tensor of shape `(6, 1)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t = torch.arange(30).reshape(5, 6)\n",
    "print(t)\n",
    "print(\n",
    "    t[1],\n",
    "    t[:, 2],\n",
    "    t[1, 2],\n",
    "    t[:2],\n",
    "    t[:, -3:],\n",
    "    t[1::2],\n",
    "    t[-1, :, None],\n",
    "    sep='\\n'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preprocessing\n",
    "\n",
    "## Load the Dataset as Tensors\n",
    "\n",
    "In a real world machine learning scenario, when you receive a dataset, it may be in an arbitrarily convoluted format that you first have to convert to a collection of tensors. As you may be new to Python, we'll skip this and download the tensors directly.\n",
    "\n",
    "See the appendix for the conversion procedure. I highly recommend checking it out as soon as you have the energy as\n",
    "1. it may help get familiar with various Python concepts and\n",
    "2. data conversion is an essential element of the machine learning pipeline; in particular, during the course we'll transition toward performing data conversion ourselves.\n",
    "\n",
    "Please download `data/abalone.pt` from the course website or repository.\n",
    "Set the value of the\n",
    "`path` variable to the path you downloaded the file to.\n",
    "You need to give the path as a string,\n",
    "that is enclose it in single `'` or double `\"` quotes.\n",
    "\n",
    "We can use the `torch.load` function to load the converted dataset. Its first argument should be `path`. Then you can give it the `weights_only=True` keyword argument for the safety feature of it only letting to load numerical tensors.\n",
    "\n",
    "The function call output the converted dataset that was loaded. Use the `variable_name = value` pattern to assign the converted dataset to the variable `dataset`.\n",
    "\n",
    "`print` the result, to see if it looks all right."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"data/abalone.pt\"\n",
    "dataset = torch.load(path, weights_only=True)\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use the `for key, value in dataset.items():` pattern to iterate over the contents of the dictionary as `key, value` pairs. In each iteration, print:\n",
    "\n",
    "1. the key,\n",
    "2. the shape (for a tensor `t`, you can access this by `t.shape`), and\n",
    "3. the first 5 entries of the value tensor.\n",
    "\n",
    "One way to print multiple objects with one `print` statement is to give them as separate arguments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for key, value in dataset.items():\n",
    "    print(key, value.shape, value[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the values are 4177-dimensional vectors for the columns of the dataset. There are 2 special columns:\n",
    "\n",
    "1. `number_of_rings` tells the age of the abalone, that is this column gives the targets.\n",
    "2. `sex_id` is a categorical feature with values integer IDs of the categories. See the Appendix on how to get this from the original values of `\"F\"`, `\"I\"` and `\"M\"`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting the Feature Matrix\n",
    "\n",
    "### Stack the Numerical Columns\n",
    "\n",
    "The first step towards getting the feature matrix is to stack the 7 non-special columns to form a `(4177, 7)` matrix. We can do this with the function `torch.stack`. You can find its documentation here:  \n",
    "https://pytorch.org/docs/stable/generated/torch.stack.html\n",
    "\n",
    "1. Its first argument should be a sequence of tensors to stack. You can form such a list using the pattern\n",
    "    ```\n",
    "    [dataset[key] for key in [key1, key2, ...]]\n",
    "    ```\n",
    "    where `key1, key2, ...` are the names of the non-special columns.\n",
    "2. Its second argument is the index of the new dimension. As there are 7 non-special columns, this is the index of the 7 in the shape of our matrix-to-be.\n",
    "\n",
    "Assign the matrix you get to the variable `features` (we shall update it with the one-hot encoded categorial values and the column of 1's). Check that its shape is what it should be."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = torch.stack(\n",
    "    [\n",
    "        value\n",
    "        for key, value in dataset.items()\n",
    "        if key not in [\"number_of_rings\", \"sex_id\"]\n",
    "    ],\n",
    "    dim=1\n",
    ")\n",
    "print(features.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One-Hot Encoding the Categorical Features\n",
    "\n",
    "The column `dataset[\"sex_id\"]` gives category indices from `0, 1, 2`. To one-hot encode this, we can index into a 3-dimensional identity matrix with this column as index vector.\n",
    "\n",
    "This is an instance of *advanced (fancy) indexing*. You can read about it here:  \n",
    "https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing\n",
    "\n",
    "First of all, you can get a `d`-dimensional identity matrix by `torch.eye(d)`. Check that if you index into it with a list of integers from `[0, ..., d - 1]`, it outputs a matrix that 1-hot encodes the list you gave it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.eye(3)[[0, 2, 0, 1, 1, 2]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now index into the identity matrix with the category ID column vector to get the one-hot encoding. Assign it to a variable. Check its shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sex_id_one_hot = torch.eye(3)[dataset[\"sex_id\"]]\n",
    "print(sex_id_one_hot.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To append the one-hot encoding to the feature matrix, you can use the function `torch.concatenate` (or `torch.cat`!).\n",
    "\n",
    "1. In the first argument, you need to give the sequence of tensors to concatenate. In our case, we want to concatenate our feature-matrix-to-be with the one-hot encoding.\n",
    "\n",
    "2. In the second, optional argument, you can give the dimension index at which concatenation should occur. In our case, we want to get, from a `(4177, 7)` matrix and a `(4177, 3)` matrix, a `(4177, 10)` matrix. The default value of this optional argument is `0`, meaning that concatenation occurs at the last index.\n",
    "\n",
    "Check the shape of your result. If correct, assign it to the variable `features`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features_extended = torch.cat(\n",
    "    [features, sex_id_one_hot],\n",
    "    dim=1\n",
    ")\n",
    "print(features_extended.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = features_extended"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting the Target Vector\n",
    "\n",
    "Finally, we need to get the target column vector from the dataset. Presently, it is an integer vector. You can verify this by printing the `dtype` attribute of the vector (for a tensor `t`, you can get its `dtype` attribute as `t.dtype`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dataset[\"number_of_rings\"].dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can change the datatype of a tensor by calling its `to` method. This method outputs a tensor with the same content, but converted to the datatype that is given in the method argument. Convert the target vector to the datatype `torch.float32` and assign it to the `targets` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "targets = dataset[\"number_of_rings\"].to(torch.float32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting the Dataset\n",
    "\n",
    "To split the dataset, we will use a method that will also come in handy when implementing Stochastic Gradient Descent (SGD), the most widely used optimization method in Deep Learning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need a random permutation of the row indices of our dataset. You can use `torch.randperm` for this. Print the resulting permuted index vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "indices = torch.randperm(len(features))\n",
    "print(indices)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we need to know how many train split entries we should get. This is where `config[\"train_size\"]` comes in. Multiplying it with the number of rows gives us a floating point number, which we can take the floor of using the `int` function. Then the number of test entries can be the number of remaining entries in the full dataset.\n",
    "\n",
    "Get the number of test and train entries, and print them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_size = int(len(features) * config[\"train_size\"])\n",
    "test_size = len(features) - train_size\n",
    "print(test_size, train_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first `train_size` entries of the index vector can be the indices of the train split and the rest the test split. Thus, you can get the train and test index vectors by slicing the permuted index vector. Get these index vectors and print their shapes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_indices = indices[train_size:]\n",
    "train_indices = indices[:train_size]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, to get `features_train`, `targets_train`, `features_test`, and `targets_test`, we can index into the `features` and `targets` tensors with the respective index vectors. Get these tensors and print their shapes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(features_train, targets_train), (features_test, targets_test) = (\n",
    "    (features[split_indices], targets[split_indices])\n",
    "    for split_indices in (train_indices, test_indices)\n",
    ")\n",
    "\n",
    "for t in (features_train, targets_train, features_test, targets_test):\n",
    "    print(t.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I recommend saving a dictionary with values the split and preprocessed feature matrices and target vectors for easy retrieval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.save(\n",
    "    {\n",
    "        \"features_train\": features_train,\n",
    "        \"targets_train\": targets_train,\n",
    "        \"features_test\": features_test,\n",
    "        \"targets_test\": targets_test\n",
    "    },\n",
    "    \"data/abalone_preprocessed.pt\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear Regression\n",
    "\n",
    "We're ready to regress!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checking the Rank of the Feature Matrix\n",
    "\n",
    "As we discussed, least squares works if the columns of the feature matrix\n",
    "are independent. As the feature matrix has more rows than columns,\n",
    "this is equivalent to the rank being equal to the number of columns.\n",
    "\n",
    "You can calculate the rank of a matrix with `torch.linalg.matrix_rank`.\n",
    "Do this for the train feature matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(torch.linalg.matrix_rank(features_train))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solving Least Squares\n",
    "\n",
    "Time to solve the least squares problem!\n",
    "You can use the built-in solver: `torch.linalg.lstsq`\n",
    "The call `torch.linalg.lstsq(A, b)` gets the least squares solution\n",
    "of the system `Ax=b`. It returns a quadruple\n",
    "the first entry of which is the solution `x`.\n",
    "\n",
    "Note that in our case, the solution is the weight vector `w`.\n",
    "\n",
    "Using this function, get the weight vector.\n",
    "Check that it has the correct number of dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights = torch.linalg.lstsq(\n",
    "    features_train,\n",
    "    targets_train\n",
    ")[0]\n",
    "print(weights.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the Predictions\n",
    "\n",
    "With the weight vector in hand, we can now calculate the predicted values.\n",
    "The matrix product operator between tensors is `@`.\n",
    "If one of the operands is a vector,\n",
    "then this performs matrix-vector product.\n",
    "\n",
    "Then to evaluate the predictions, you can proceed as follows:\n",
    "\n",
    "1. The subtraction operation between tensors of the same shape\n",
    "performs elementwise subtraction.\n",
    "2. The power operation `t ** a` where `t` is a tensor\n",
    "and a is a number performs elementwise power operations by `a`\n",
    "on the entries of `t`.\n",
    "3. The function `torch.mean` outputs the mean of the values of a tensor.\n",
    "\n",
    "Using this, print the train MSE and the test MSE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predict_train = features_train @ weights\n",
    "train_mse = torch.mean((targets_train - predict_train) ** 2)\n",
    "\n",
    "predict_test = features_test @ weights\n",
    "test_mse = torch.mean((targets_test - predict_test) ** 2)\n",
    "\n",
    "print(\n",
    "    \"train MSE\", train_mse,\n",
    "    \"test MSE\", test_mse\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Picture is Better Than Thousand Numbers\n",
    "\n",
    "OK, so we see that the test loss is not really greater than the train loss,\n",
    "so we don't have a generalization problem.\n",
    "\n",
    "But how great is this result, really? To have a grasp on that,\n",
    "it would be great to know how much do the values usually vary.\n",
    "If all values were between 0.01 and 0.02, then missing a value by\n",
    "more than 2 on average would be super terrible.\n",
    "On the other hand, if the values varied between 1000 and 2000,\n",
    "then the result would be super terrific.\n",
    "Also, does the model miss a few predictions vastly\n",
    "or it misses more predictions somewhat?\n",
    "\n",
    "There are a lot of statistical aggregation functions that could\n",
    "answer these questions. But it makes the evaluation more tractable\n",
    "if we plot the true values and the predictions.\n",
    "\n",
    "We'll use `matplotlib` for this, via the imported module `plt`.\n",
    "1. Unless told otherwise, which we'll get to in a later session,\n",
    "each plotting command draws on a canvas.\n",
    "2. Then `plt.show()` outputs what is on the canvas.\n",
    "3. Finally, you can clear the canvas with `plt.close()`.\n",
    "\n",
    "The first plotting command we'll use is `plt.plot`. If given a single\n",
    "positional argument a 1-dimensional array `a`,\n",
    "it plots the graph of the piecewise linear function that on the interval\n",
    "(i-1, i) linearly changes value from `a[i-1]` to `a[i]`.\n",
    "This is called a *line plot*.\n",
    "\n",
    "This is the main page of the `matplotlib` documentation:\n",
    "https://matplotlib.org/stable/api/index\n",
    "\n",
    "To get a feel for `plt.plot`, run this function for a small list of numbers,\n",
    "show the canvas, then clear it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot([1, -3, 5])\n",
    "plt.show()\n",
    "plt.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now plot the train targets as a piecewise linear function,\n",
    "show the canvas, then clear it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(targets_train)\n",
    "plt.show()\n",
    "plt.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, that looks pretty much like noise\n",
    "and we didn't even add the predicted targets.\n",
    "\n",
    "To make the picture clearer, let's reorder the entries in the train dataset\n",
    "so that the targets are sorted in increasing order.\n",
    "\n",
    "As we'll want to use the same ordering for the predicted targets,\n",
    "we want to reorder the dataset indices. You can achieve this\n",
    "with the function `torch.argsort`. If applied to a tensor `t`\n",
    "this outputs a tensor of indices `[i_0,...,i_d]` so that\n",
    "`t[i_0],...,t[i_d]` is the sequence of targets of `t`\n",
    "in increasing order.\n",
    "\n",
    "To get a feel for this:\n",
    "1. Get a small tensor by calling `torch.tensor` on a small list of numbers.\n",
    "2. Print what you get if you feed `torch.argsort` this small tensor.\n",
    "3. Print what you get if you index your small tensor\n",
    "with the output of `argsort`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t = torch.tensor([3, -2, 0, 4, 8])\n",
    "i = torch.argsort(t)\n",
    "print(i)\n",
    "print(t[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Assign the output of `torch.argsort` when applied to the train values\n",
    "to a variable.\n",
    "2. Then index the train targets with the output of the argsort.\n",
    "3. Print out what you get."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_argsort = torch.argsort(targets_train)\n",
    "print(targets_train[train_argsort])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If all went well, then what you see are the train targets\n",
    "sorted in increasing order. You can reorder the predicted targets\n",
    "with the same index tensor to line up the true targets with the\n",
    "corresponding predicted targets.\n",
    "\n",
    "1. Apply `plt.plot` to the reordered train targets.\n",
    "1. Apply `plt.plot` to the predicted targets indexed by the same indices.\n",
    "Show the canvas and clear it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(targets_train[train_argsort])\n",
    "plt.plot(predict_train[train_argsort])\n",
    "plt.show()\n",
    "plt.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this, we can have a clearer picture of how good predictions we have.\n",
    "Note that as the predicted values are not sorted,\n",
    "their line plot is using up a lot of pixels, in particular it covers\n",
    "most of the true value plot.\n",
    "\n",
    "Thus, it is better to use a *scatter plot* `plt.scatter`\n",
    "for the predicted values.\n",
    "Given a sequence of x-values and a sequence of y-values,\n",
    "that only plots the vertices one by one\n",
    "and does not draw lines between them.\n",
    "\n",
    "So, our y-values will be the predicted train values indexed by the argsort.\n",
    "For x-values, we need the sequence 0, ..., (dataset size - 1)\n",
    "You can get the latter by `torch.arange` that with an argument `n`\n",
    "outputs the vector [0, ..., n-1]\n",
    "\n",
    "1. Again, make a line plot of the sorted true train values.\n",
    "2. Make a scatter plot of the predicted train values with the argsort.\n",
    "3. Show the canvas and clear it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(targets_train[train_argsort])\n",
    "plt.scatter(\n",
    "    torch.arange(len(train_argsort)),\n",
    "    predict_train[train_argsort]\n",
    ")\n",
    "plt.show()\n",
    "plt.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, those vertices in the scatter plot look pretty big.\n",
    "You can adjust their size with the `s` keyword argument in `plt.scatter`.\n",
    "Try 1.\n",
    "\n",
    "Also, note that although successive `plt.plot` calls varied the color\n",
    "automatically, this did not happen for `plt.scatter`.\n",
    "You can adjust the color of `plt.scatter` with the `c` keyword argument.\n",
    "You can give basic colors as strings. For example, you can try \"red\"\n",
    "\n",
    "Redo the plot with these changes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(targets_train[train_argsort])\n",
    "plt.scatter(\n",
    "    torch.arange(len(train_argsort)),\n",
    "    predict_train[train_argsort],\n",
    "    c=\"red\",\n",
    "    s=1\n",
    ")\n",
    "plt.show()\n",
    "plt.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now is the time to make a similar plot for the true and predicted\n",
    "test values. But better to write a function, eh?\n",
    "You can make the two arguments the predicted values and true values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_regression_values(\n",
    "    values_predict: torch.Tensor,\n",
    "    values_true: torch.Tensor\n",
    "):\n",
    "    \"\"\"\n",
    "    Given predicted and true values for a regression task\n",
    "    with 1-dimensional value space,\n",
    "    makes a line plot of the true values sorted in increasing order\n",
    "    and a scatter plot of the corresponding predicted values.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    values_predict : torch.Tensor\n",
    "        A 1-dimensional tensor of predicted values.\n",
    "    values_true : torch.Tensor\n",
    "        A 1-dimensional tensor of true values.\n",
    "    \"\"\"\n",
    "    values_argsort = torch.argsort(values_true)\n",
    "    plt.plot(values_true[values_argsort])\n",
    "    plt.scatter(\n",
    "        torch.arange(len(values_argsort)),\n",
    "        values_predict[values_argsort],\n",
    "        c=\"red\",\n",
    "        s=1\n",
    "    )\n",
    "    plt.show()\n",
    "    plt.close()\n",
    "\n",
    "plot_regression_values(predict_test, targets_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Appendix: Data Conversion\n",
    "\n",
    "## Imports\n",
    "\n",
    "Let's import the `datasets` library:  \n",
    "https://huggingface.co/docs/datasets/index  \n",
    "This lets us access the dataset repository of Hugging Face:\n",
    "https://huggingface.co/datasets\n",
    "\n",
    "Hugging Face is one of the biggest open source collections\n",
    "of deep learning models, datasets and learning algorithms.\n",
    "Their primary focus is on Natural Language Processing (NLP)\n",
    "but they have plenty of resources pertaining to other directions\n",
    "of Machine Learning such as image generation or reinforcement learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading and Inspecting the Dataset\n",
    "\n",
    "Let's load the abalone dataset!\n",
    "We'll load the following incarnation from Hugging Face datasets:\n",
    "https://huggingface.co/datasets/mstz/abalone\n",
    "Inspecting this webpage, note the following three things:\n",
    "1. The ID of the dataset is \"mstz/abalone\"\n",
    "2. As per the Subset rubric, there are two subsets of this dataset:\n",
    "- \"abalone\", the original. This is the one we want\n",
    "- \"binary\", where the target is not the age as a number,\n",
    "  but the binary information whether the age is more than 9.\n",
    "3. As per the Split rubric, there is a single split, called \"train\"\n",
    "We'll want to load the \"train\" split of the \"abalone\" subset.\n",
    "\n",
    "To load the dataset, we'll use the function `datasets.load_dataset`:\n",
    "https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/loading_methods#datasets.load_dataset\n",
    "\n",
    "As we want to load the \"train\" split of the \"abalone\" subset,\n",
    "let's call the function with the following arguments:\n",
    "- `path=\"mstz/abalone\"`\n",
    "- `name=\"abalone\"`\n",
    "- `split=\"train\"`\n",
    "\n",
    "so that it returns the dataset\n",
    "and assign its return value to the variable `dataset`.\n",
    "\n",
    "`print` the `dataset`.\n",
    "This will give you some information on the object.\n",
    "What can you see?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = datasets.load_dataset(\n",
    "    name=\"abalone\",\n",
    "    path=\"mstz/abalone\",\n",
    "    split=\"train\"\n",
    ")\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datasets are indexable by integers.\n",
    "If you write `dataset[i]` where `i` is an integer,\n",
    "then you get the `i`-th data entry.\n",
    "\n",
    "`print` this for some integer `i`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dataset[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also perform slicing on dataset.\n",
    "That is:\n",
    "1. If you write `dataset[:i]` where `i` is an integer,\n",
    "  you get the first `i` rows.\n",
    "2. If you write `dataset[i:]` where  `i` is an integer,\n",
    "  you get all rows after the `i`-th (inclusive, starting with 0).\n",
    "3. If you write `dataset[i:j]` where `i` and `j` are integers,\n",
    "  you get the rows from the `i`-th (inclusive) to the `j`-th (exclusive)\n",
    "4. In each of these cases, a negative integer indexes from the right.\n",
    "  That is, `-1` indexes the last entry, `-2` indexes the second to last...\n",
    "  \n",
    "Print out the third (inclusive) to sixth (inclusive) row of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dataset[3:7])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datasets are indexable by strings that are column names.\n",
    "You could see the column names as the value of `\"features\"`\n",
    "when you called `print(dataset)`.\n",
    "\n",
    "If you write `dataset[s]` where `s` is a column name,\n",
    "then you get the `list` of the values of that column.\n",
    "\n",
    "`print` this for some column name `s`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dataset[\"sex\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making All Columns Numerical\n",
    "\n",
    "To be able to perform least squares,\n",
    "we need the dataset only contain numerical data.\n",
    "Is this the case?\n",
    "If you're unsure, check out `dataset.info.features`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dataset.info.features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like `\"sex\"` is a string-valued feature.\n",
    "To see its possible values, you can use `set`:\n",
    "https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset\n",
    "This will construct a `set` of the values of the iterable it gets.\n",
    "\n",
    "*iterable* means that you can iterate over the object,\n",
    "for example with a `for` loop.\n",
    "In our case, `dataset[\"sex\"]` is the list of the values of the `\"sex\"` column.\n",
    "In particular, this is an iterable: if you do `for value in dataset[\"sex\"]`\n",
    "you get the values in the column row by row.\n",
    "\n",
    "So if you feed `set` this iterable,\n",
    "it will get you the set of unique values.\n",
    "Save this set to a variable, then print it out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sex_unique = set(dataset[\"sex\"])\n",
    "print(sex_unique)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we want to assign each of these unique values an integer ID\n",
    "To that end, you can do the following:\n",
    "\n",
    "1. Given an iterable `c` such as the set of unique values,\n",
    "if you iterate over `enumerate(c)`, you'll get pairs `i, e`\n",
    "where `e` is an element of `c` and `i` is an index.\n",
    "Thus, you can get these values for example with `for i, e in enumerate(c)`\n",
    "See here for more info:\n",
    "https://exercism.org/tracks/python/concepts/unpacking-and-multiple-assignment\n",
    "\n",
    "2. Now if you loop over the unique values like this,\n",
    "you can fill a dictionary with key `e` and value `i`.\n",
    "You can use this dictionary to convert a unique value to an index.\n",
    "\n",
    "Create this dictionary and print it out to see if everything's all right."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sex2id = {s: i for i, s in enumerate(sex_unique)}\n",
    "print(sex2id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write a function of signature\n",
    "```\n",
    "add_sex_id(row: dict) -> dict\n",
    "```\n",
    "that takes a dataset row, and adds the integer ID\n",
    "of the value at key \"sex\" as the value of the new key \"sex_id\".\n",
    "By signature, we meen that we describe\n",
    "how many positional and keyword arguments the function should have.\n",
    "In this case, the function should have one positional argument.\n",
    "The so-called type hints `row: dict` and `-> dict` mean\n",
    "that the one argument of the function is expected to be a dictionary\n",
    "and the function returns a dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_sex_id(row: dict) -> dict:\n",
    "    \"\"\"\n",
    "    Given a dictionary representing a dataset entry,\n",
    "    adds a new value with key \"sex_id\" that is the integer ID\n",
    "    of the value at key \"sex\".\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    row : dict\n",
    "        A dictionary representing a dataset entry.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    The updated dataset entry.\n",
    "    \"\"\"\n",
    "    row[\"sex_id\"] = sex2id[row[\"sex\"]]\n",
    "    \n",
    "    return row"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can create a new dataset by applying `add_sex_id` to each row.\n",
    "This can be achieved with the method `datasets.Dataset.map`\n",
    "\n",
    "*method*: the object `dataset` the function `datasets.load_dataset`\n",
    "returns is of *type (class)* `datasets.Dataset`.\n",
    "\n",
    "Every object in Python is typed. You can think of a type as a blueprint\n",
    "that comes with a collection of data\n",
    "and special functions that you can apply to the object.\n",
    "\n",
    "1. The data entries are called *attributes* or *properties*.\n",
    "For example, `datasets.Dataset.info` is an attribute of type\n",
    "`datasets.DatasetInfo` which in turn has the attribute `features`\n",
    "that gives a dictionary of columns and their data types.\n",
    "This is what I asked you to print out by `print(dataset.info.features)`.\n",
    "\n",
    "2. The special functions are called *methods*.\n",
    "They are like functions with the object the first positional argument.\n",
    "But they are adapted to the object they are a method of.\n",
    "That is, methods of the same name on object of a different type\n",
    "maybe do different things, depending on what makes sense for the given object.\n",
    "\n",
    "See eg. here for more info:  \n",
    "https://exercism.org/tracks/python/concepts/classes\n",
    "\n",
    "So, back to using the method `dataset.map`.\n",
    "You can set `add_sex_id` as its first positional argument.\n",
    "Please also set the `remove_columns` keyword argument\n",
    "so that you erase the column `\"sex\"` as we'll not need it anymore.\n",
    "You can assign the output of `dataset.map` to `dataset`\n",
    "as we'll not not the original dataset anymore.\n",
    "Perform this operation and check that\n",
    "you have the new column \"sex_id\" and you don't have the column \"sex\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = dataset.map(\n",
    "    add_sex_id,\n",
    "    remove_columns=\"sex\"\n",
    ")\n",
    "print(dataset)\n",
    "print(dataset[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conversion to Tensors\n",
    "\n",
    "We can use `Dataset.with_format` to return a dataset object\n",
    "where the contents of a column are returned as a `torch.Tensor`\n",
    "instead of a list. To this end, you need to set the first positional argument to `\"torch\"`.\n",
    "\n",
    "Perform this operation on the datasetset\n",
    "and print a row of the dataset you get.\n",
    "Check that the entries are now tensors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = dataset.with_format(\"torch\")\n",
    "print(dataset[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's save this dataset of tensors as a plain dictionary that maps feature names to feature tensors (here, the target is also viewed as a feature). To this end, we use the `torch.save` function. \n",
    "\n",
    "1. The first argument is the object to save. Let's convert the dataset with `torch` format to a dictionary. To get such a dictionary, you can use the empty indexing operation `[:]` on the dataset.\n",
    "\n",
    "2. The second argument is the path to the file the object should be saved to. I recommend saving data files to a `data` directory. Moreover, objects saved by the `torch.save` function have by convention a `pt` extension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.save(\n",
    "    dataset[:],\n",
    "    \"data/abalone.pt\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Datasets\n",
    "\n",
    "## Abalone\n",
    "\n",
    "https://archive.ics.uci.edu/dataset/1/abalone  \n",
    "This dataset is licensed under a [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/legalcode) license."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# License\n",
    "\n",
    "This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
