{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Frozen Lake with Q-Learning\n",
    "\n",
    "In this lab, we will solve 8x8 Frozen Lake with Q-Learning. For starters, we will use a constant learning rate. In Homework 5, we will implement the learning rate schedule as described in [1].\n",
    "\n",
    "## Setup\n",
    "\n",
    "### Imports\n",
    "\n",
    "Import our friends `gym`, `Image`, `mpl`, `plt`, `os`, `torch` and `tqdm`.\n",
    "\n",
    "Moreover, import the functions `line_plot_confidence_band`, `get_seed` and `run_episode` from Notebooks 0221, 0228 and 0307, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Constants\n",
    "\n",
    "Create a configuration dictionary with the following keys:\n",
    "- `\"discount\"`: `float`  \n",
    "    This is the constant we use to get discounted returns. Make it the default `0.99`.\n",
    "- `\"env_id\"`: `str`  \n",
    "    In the unified API of `gym`, every MDP, or *environment* has a unique identifier. We'll play Frozen Lake. You can see on its page:  \n",
    "    https://gymnasium.farama.org/environments/toy_text/frozen_lake/  \n",
    "    that its identifier is `\"FrozenLake-v1\"`.\n",
    "- `\"env_kwargs\"`: `dict`  \n",
    "    This dictionary stores extra settings in the environment. Set it to the following:\n",
    "    - `\"is_slippery\"`: `bool`  \n",
    "        This makes the transitions stochastic:  \n",
    "        https://gymnasium.farama.org/environments/toy_text/frozen_lake/#is_slippy  \n",
    "        Set this to `False` for now.\n",
    "    - `\"map_name\"`: `str`  \n",
    "        If given, a preloaded map will be used:  \n",
    "        https://gymnasium.farama.org/environments/toy_text/frozen_lake/#arguments  \n",
    "        Let's dare set this to `\"8x8\"`!\n",
    "- `\"env_num_eval\"`: `int`  \n",
    "    The number of environments to run in parallel for evaluation. Set this to 16.\n",
    "- `\"env_num_train\"`: `int`  \n",
    "    The number of environments to run in parallel for training. We'll use this to try out various learning rates at once. Set this to 13.\n",
    "- `\"eval_interval\"`: `int`  \n",
    "    The frequency of evaluations, measured in train steps. Set this to `1000`.\n",
    "- `\"gif_fps\"`: `int`  \n",
    "    The frames per second (FPS) to use when making gameplay gifs. I set this to `20`. Change it at will.\n",
    "- `\"improvement_threshold\"`: `float`  \n",
    "    In evaluation, we should get a result at least this much better than the previous best to count as an improvement, for numerical stability. Set this to `1e-4`.\n",
    "- `\"learning_rate\"`: `int | torch.Tensor`  \n",
    "    Either a constant learning rate to use in all training environments or a different one for each. Opt for the latter case, with values $10^i$ for $i=-5,-4.5,\\dotsc,0.5,1$.\n",
    "- `\"q_init\"`: `float`  \n",
    "    The initial values of the $Q$-matrix. Since in an entire episode of Frozen Lake, there is a single reward of 1 if you reach the chest, you will be very optimistic if you set this to 1.\n",
    "- `\"seed\"`: `int`  \n",
    "    This is for reproducible experiments. Insert any integer.\n",
    "- `\"steps_num\"`: `int`  \n",
    "    The number of train steps to take during training. As RL training is very unstable, we'll revert to this from early stopping. Set it to `10_001`.\n",
    "- `\"videos_directory\"`: `str`  \n",
    "    The path to the directory to store videos at. I set this to `videos`. Change it at will.\n",
    "\n",
    "Note that we don't include a `device` key: this is because the environments run on CPU anyway and $Q$-learning is such a simple algorithm that it is not worth copying values between RAM and GPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the seed of `torch` pseudo-random number generation to the value given in the configuration dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Meet a `gym.vector.VectorEnv`\n",
    "\n",
    "To speed up training and evaluation, we will run multiple environments in parallel. An interface for this is `gym.vector.VectorEnv`. Such an environment takes actions and returns observations, rewards, truncations and terminations in batches as arrays. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `make_vec` a `VectorEnv`\n",
    "\n",
    "Just like the function `gym.make` create a single environment, the function `gym.make_vec` creates a vectorized environment. The extra keyword argument `num_envs` determines the number of environments to run in parallel.\n",
    "\n",
    "Create a vectorized environment of `env_num_train` instances of Frozen Lake with settings as given in the configuration dictionary. Then print its observation and action spaces."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the observation and action spaces hold `env_num_train` copies of the observation and action spaces of a single environment.\n",
    "\n",
    "This time, these are all discrete. You can get the array of sizes at the `nvec` attribute.\n",
    "\n",
    "Moreover, you can get the observation and action spaces of a single environment as the attributes `single_observation_space` and `single_action_space`.\n",
    "\n",
    "Print the arrays of sizes and the single environment spaces."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reset the environment and print the output you get."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that, indeed, you get an array of observations for each instance. Moreover, the metadata dictionary is also vectorized."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `np.ndarray` and `torch.Tensor`\n",
    "\n",
    "`gym` natively uses `numpy` as an array interface. If you know what these are, feel free to do whatever I say about `torch.Tensor`s with them. If not: `numpy`: https://numpy.org/ is the standard library to use for array manipulations on CPU. It does not offer GPU support or automatic differentiation, so for our purposes, you can view it as a precursor to `torch`. It is customary to import `numpy` as `np`. Its arrays have type `np.ndarray`.\n",
    "\n",
    "You can convert a scalar, sequence of scalars, `np.ndarray` or `torch.Tensor` to a `torch.Tensor` using the function `torch.asarray`. This function has `device` and `dtype` keyword arguments to set storage and datatype. You can use it to convert the arrays a `VectorEnv` outputs to `torch.Tensor`s.\n",
    "\n",
    "To convert a `torch.Tensor` to a `np.ndarray`, you can use its `numpy` method. To make sure that the tensor is first moved to RAM, you can call its `cpu` method before this. You can use this to convert a tensor of actions to the format required by `VectorEnv.step`.\n",
    "\n",
    "Reset your `VectorEnv` again and assign the observation array to a variable. Convert it to a `torch.Tensor` and print that out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "observations = env.reset(seed=get_seed())[0]\n",
    "observations = torch.asarray(observations)\n",
    "print(observations)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make a batch of random steps! You could get an array of random actions using `VectorEnv.action_space.sample`, but for practice we will use `torch` for this:\n",
    "\n",
    "You can use `torch.randint` to get a tensor of integers:\n",
    "1. If you give a single positional argument, that will be the exclusive upper bound, with the inclusive lower bound being the default `0`.\n",
    "2. In the `size` keyword argument, you can give the shape of the tensor to generate.\n",
    "\n",
    "Generate such a random action tensor with upper bound the size of the action space and shape a single dimension of length the number of training environments. Print it out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transform the actions tensor to a `np.ndarray` and feed it to the `step` method of the `VectorEnv`. Print the output of that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing Q-Learning\n",
    "\n",
    "\n",
    "### Initialization\n",
    "\n",
    "Let's first initialize a $Q$-matrix. We will make a batch of them, so that for each learning rate we can train another one in parallel.\n",
    "\n",
    "You may note that we make this matrix have constant value `q_init`, although in the Theorem, it is required that all values pertaining to terminal states are 0. We will get around this issue by zeroing out the appropriate values via the termination boolean arrays. This way, we don't have to know beforehand the observations of which indices are terminal, so the method can be readily applied to any environment.\n",
    "\n",
    "Write the function below, get its output and print its shape and datatype."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_init_q_matrix(\n",
    "    config: dict,\n",
    "    env: gym.vector.VectorEnv\n",
    ") -> torch.Tensor:\n",
    "    \"\"\"\n",
    "    Initializes a batch of Q-matrices with a constant value.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    config : `dict`\n",
    "        Configuration dictionary. Required key:\n",
    "        `q_init` : `float`\n",
    "            The value to initialize the batch of Q-matrices with.\n",
    "    env : `gym.vector.VectorEnv`\n",
    "        The environment we intend to train the Q-matrix to solve.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    A `torch.Tensor` of constant value `config[\"q_init\"]` and shape\n",
    "    `(num_envs, observation_space_n, action_space_n)`\n",
    "    where:\n",
    "    1. `num_envs` is the number of environments in `env`,\n",
    "    2. `observation_space_n` is the size\n",
    "        of a single observation space in `env` and\n",
    "    3. `action_space_n` is the size\n",
    "        of a single action space in `env`.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Greedy and $\\epsilon$-Greedy Policies\n",
    "\n",
    "Let's write the policies! There are two cases to prepare for:\n",
    "1. In evaluation, we will evaluate every one of the `env_num_train` $Q$-matrices on `env_num_eval` parallel environments, to get better estimates of discounted returns and with confidence intervals. Thus, given:\n",
    "    1. `env_num_eval` observations and\n",
    "    2. a single $Q$-matrix,\n",
    "\n",
    "    we need to get `env_num_eval` actions as per the greedy policy.\n",
    "2. In training, given:\n",
    "    1. `env_num_train` observations and\n",
    "    2. `env_num_train` $Q$-matrices,\n",
    "\n",
    "    we need to get `env_num_train` actions as per the $\\epsilon$-greedy policy.\n",
    "\n",
    "To accommodate each situation, we will write a function that expects:\n",
    "1. a vector of observations,\n",
    "2. a 2d or 3d tensor of $Q$-values and\n",
    "3. optionally, an $\\epsilon$ value, with default 0.\n",
    "\n",
    "Then to output the action vector, you can proceed as follows:\n",
    "1. Apply the `torch.asarray` function to the observation vector to make sure it is converted to a `torch.Tensor`.\n",
    "2. Get the number of environments from the observation vector.\n",
    "3. Get\n",
    "    1. the observation space size and\n",
    "    2. the action space size\n",
    "\n",
    "    as the last two entries of the shape of the tensor of $Q$-values.\n",
    "4. Expand the tensor of $Q$-values to 3d using its `broadcast_to` method. After this point, you can expect it to have shape `(num_envs, observation_space_n, action_space_n)`.\n",
    "5. Create an empty tensor like the observation tensor, to hold the actions.\n",
    "6. Create a mask for the actions to be generated using the greedy policy:\n",
    "    1. Get `num_envs` samples from the uniform distribution on the unit interval with `torch.rand`.\n",
    "    2. The mask is the Boolean tensor of positions in the above where the value is larger than $\\epsilon$.\n",
    "7. To use the greedy policy where the mask values are `True`:\n",
    "    1. For the selected $Q$-matrices, we need to get the vectors of action values at the selected observations. We can use [advanced indexing](https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing) for this: index into the the tensor of $Q$-values with the greedy mask and the observations at the greedy mask.\n",
    "    2. Now for each greedy entry, we have a vector of action values. So we can take `argmax` to get, for each action value vector, the index of its maximum.\n",
    "8. To use the random policy where the mask values are `False`:\n",
    "    1. First of all, to negate a Boolean array, you can use the negation operator `~`.\n",
    "    2. Then fill the non-greedy entries of the action vector by the output of `torch.randint`:\n",
    "        1. Use as exclusive upper bound the action space size.\n",
    "        2. Use as shape the 1-tuple of the number of non-greedy entries. You can get the number of `True` values in a Boolean tensor via the method `sum`.\n",
    "9. Return the action vector.\n",
    "\n",
    "Write the function, then get action vectors for a full `0` observation vector and either\n",
    "1. the first $Q$-matrix in the initial batch, with the default $\\epsilon$ value of $0$ and\n",
    "2. the full batch of $Q$-matrices with $\\epsilon=0.5$.\n",
    "\n",
    "Are you getting action vectors you would expect?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def argmax_q_policy(\n",
    "    observations: torch.Tensor,\n",
    "    q_values: torch.Tensor,\n",
    "    epsilon=0.\n",
    ") -> torch.Tensor:\n",
    "    \"\"\"\n",
    "    Gets actions according to the epsilon-greedy policy.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    observations : `torch.Tensor`\n",
    "        An observation vector of size `num_envs`.\n",
    "    q_values : `torch.Tensor`\n",
    "        A tensor of Q-values of shape either\n",
    "        `(observation_space_n, action_space_n)` or\n",
    "        `(num_envs, observation_space_n, action_space_n)`.\n",
    "    epsilon : `float`, optional\n",
    "        The epsilon value for the epsilon-greedy policy. Default: 0.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    A vector of actions of size `num_envs`.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation\n",
    "\n",
    "Let's write a function that evaluates a $Q$-matrix!\n",
    "\n",
    "You can proceed as follows:\n",
    "1. Initialize a `VectorEnv` with `env_num_eval` instances.\n",
    "2. Reset the environment (don't forget to `get_seed`) and get the observation vector.\n",
    "3. Create a Boolean vector to keep track of which episode is *ongoing*, that is has not reached a terminal observation yet.\n",
    "    1. It should have `env_num_eval` entries.\n",
    "    2. At start, all its entries should be `True`.\n",
    "4. Create a floating-point vector to accumulate discounted returns in.\n",
    "    1. This should also have `env_num_eval` entries.\n",
    "    2. At start, all its entries should be `0.`.\n",
    "    3. Its datatype should be `float_dtype`.\n",
    "5. Create a step counter, initialize it at 0.\n",
    "6. Loop while any of the episodes is ongoing:\n",
    "    1. You can check if any of the entries of a Boolean tensor is `True` by the `any` method.\n",
    "    2. Get a greedy policy action vector using `argmax_q_policy`.\n",
    "    3. Take an environment step with the action vector (transformed to a `np.ndarray`).\n",
    "    4. Add to the entries of the return tensor where the ongoing vector is `True` the product of:\n",
    "        1. `discount` raised to the power of the step counter and\n",
    "        2. the reward vector (transformed to a floating-point `torch.Tensor`) masked by the ongoing vector .\n",
    "    5. Update the ongoing vector:\n",
    "        1. You can use the in-place bitwise and `&=` operator.\n",
    "        2. To get the mask of the episodes that have not just terminated, you can negate the terminated vector (transformed to a Boolean `torch.Tensor`)\n",
    "    6. Increment the step counter.\n",
    "7. Close the environment.\n",
    "8. Return the vector of discounted returns.\n",
    "\n",
    "Write the function, then apply it to an initial $Q$-matrix. Is the vector of discounted returns you get what you expect?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_q_values(\n",
    "    config: dict,\n",
    "    q_values: torch.Tensor\n",
    ") -> torch.Tensor:\n",
    "    \"\"\"\n",
    "    Evaluates a Q-matrix on a vectorized environment.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    config : `dict`\n",
    "        Configuration dictionary. Required keys:\n",
    "        discount : `float`\n",
    "            Discount to use when calculating the discounted return.\n",
    "        env_id : `str`\n",
    "            The identifier of the environment.\n",
    "        env_kwargs : `dict`\n",
    "            Extra keyword arguments of the environment.\n",
    "        env_num_eval : `int`\n",
    "            Number of evaluation environments.\n",
    "    q_values : torch.Tensor\n",
    "        The Q-matrix to evaluate.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    The vector of discounted returns.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training\n",
    "\n",
    "Time to write the training loop!\n",
    "\n",
    "You can proceed as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Initialization\n",
    "\n",
    "1. Initialize a `VectorEnv` with `env_num_train` instances.\n",
    "2. Create a vector `env_arange` of values `0,...,env_num_train-1`. We'll need this when getting the V-values in the Q-update with advanced indexing.\n",
    "3. Reset the environment (don't forget to `get_seed`) and get the vector of observations. Let's call this the vector of *current* observations.\n",
    "5. Create a progress bar with `tqdm.tqdm`:\n",
    "    1. As positional argument, give it a tensor with values `1, 1-1/steps_num, ..., 1/steps_num` to serve as $\\epsilon$ schedule.\n",
    "    2. As `total` keyword argument, give it `steps_num`.\n",
    "6. Initialize a batch of $Q$-matrices with `get_init_q_matrix`.\n",
    "7. Initialize an evaluation counter at `0`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Initialization of output dictionary\n",
    "\n",
    "Let's fill the output dictionary with tensor of initial values:\n",
    "1. First, get the number `eval_num` of evaluations. As we will want to evaluate the initial $Q$-values also, this should be $\\lceil\\frac{\\text{steps\\_num}}{\\text{eval\\_interval}}\\rceil$.\n",
    "2. At key `best_avg_return`, we should record the best average discounted returns per $Q$-matrix. Thus, this should be a vector of constant value $-\\infty$ and size `env_num_train`.\n",
    "3. At key `best_q_values`, we should store for each training environment the best $Q$-matrix. Thus, this should be a `clone` of the initial $Q$-values.\n",
    "4. At key `eval_returns`, we should record the discounted return vectors for each $Q$-matrix at each evaluation. Thus, this should be an empty floating-point tensor of shape `(eval_num, env_num_train, env_num_eval)`.\n",
    "4. At key `eval_steps`, we should record the number of training steps when the evaluations took place. Thus, this should be an empty vector of size `eval_num`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training Loop\n",
    "\n",
    "Iterate over `enumerate` of the progress bar: \n",
    "1. This gets you pairs of a step ID and an $\\epsilon$ value.\n",
    "2. If the step ID is divisible by `eval_interval`:\n",
    "    1. Loop over pairs of an environment ID and a $Q$-matrix. You can do this via applying `enumerate` to the batch of $Q$-matrices.\n",
    "        1. Get a vector of discounted returns using `evaluate_q_values`.\n",
    "        2. Update the appropriate slice of the `eval_returns` tensor in the output dictionary.\n",
    "        3. If the mean of the vector of discounted returns is larger than the appropriate previous best average discounted return plus the improvement threshold:\n",
    "            1. Update the appropriate entry of the `best_avg_return` vector in the output dictionary.\n",
    "            2. Update the appropriate slice of the `best_q_values` tensor in the output dictionary.\n",
    "    2. Update the appropriate entry in the `eval_steps` vector of the output dictionary.\n",
    "3. Get an action vector with the $\\epsilon$ greedy policies via `argmax_q_policy`.\n",
    "4. Take a step in the environment with the action vector. Let's call the observation vector you get the vector of *next* observations.\n",
    "5. Take a $Q$-update:\n",
    "    1. The *target* is the sum of:\n",
    "        1. the reward vector and\n",
    "        2. the product of:    \n",
    "            1. the `discount`,   \n",
    "            2. the Boolean vector of episodes that haven't just ended and    \n",
    "            3. the $V$-values according to the next observations:    \n",
    "                1. First, you want to get, for each next observation, the vector of action values. You can get these by advanced indexing into the $Q$-value tensor by `env_arange` and the vector of next observations.    \n",
    "                2. Then, you can take the `max` of this matrix along the last dimension. Note that in `torch`, the `max` method with the `dim` keyword argument specified actually returns a pair of maximum values and argmax values, so you need to take the first entry.\n",
    "    2. The *error* is the difference of:\n",
    "        1. the target and\n",
    "        2. the vector of $Q$-values at each current observation and the corresponding actions taken. You can get this by advanced indexing into the tensor of $Q$-values.\n",
    "    3. Now you can update the vector of $Q$-values at each current observation and the corresponding actions taken by adding to it learning rate times the error.\n",
    "    4. Finally let the current observations be the next observations.\n",
    "6. Close the environment and the progress bar.\n",
    "7. Return the output dictionary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write the training loop and run it. Print the vector of best average discounted returns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def q_learning(\n",
    "    config: dict\n",
    ") -> dict:\n",
    "    \"\"\"\n",
    "    Q-learning training loop on a vectorized environment\n",
    "    with optionally different learning rates.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    config : `dict`\n",
    "        Configuration dictionary. Required values:\n",
    "        discount : `float`\n",
    "            Discount to use when calculating the discounted return.\n",
    "        env_id : `str`\n",
    "            The identifier of the environment.\n",
    "        eval_interval: `int`\n",
    "            The frequency of evaluations,\n",
    "            measured in train steps. Set this to 1000.\n",
    "        env_kwargs : `dict`\n",
    "            Extra keyword arguments of the environment.\n",
    "        env_num_eval : `int`\n",
    "            Number of evaluation environments.\n",
    "        env_num_train : `int`\n",
    "            Number of training environments.\n",
    "        improvement_threshold: `float`\n",
    "            In evaluation, we should get a result\n",
    "            at least this much better than the previous best\n",
    "            to count as an improvement, for numerical stability.\n",
    "        learning_rate: `int | torch.Tensor`\n",
    "            Either a constant learning rate\n",
    "            to use in all training environments\n",
    "            or a different one for each.\n",
    "        steps_num : `int`\n",
    "            Number of training steps.\n",
    "\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Results\n",
    "\n",
    "### Training Curves\n",
    "\n",
    "Looks like for a number of learning rates, the process did find a good $Q$-matrix! Let's plot some learning curves:\n",
    "1. Get `env_num_train` colors from a colormap, just like in Notebook 0221.\n",
    "2. Looping over triples of:\n",
    "    1. a color from the colormap\n",
    "    2. a learning rate and\n",
    "    3. a training environment index\n",
    "\n",
    "    make a `line_plot_confidence_band` with:\n",
    "    1. x-values given by `eval_steps` and\n",
    "    2. y-values gives by the appropriate slice of `eval_returns`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vid or It Didn't Happen!\n",
    "\n",
    "Let's make a video of a good policy!\n",
    "1. Select the index of one of the best average discounted returns and get the correponding best $Q$-matrix.\n",
    "2. Make a single environment with `render_mode=\"rgb_array\"`, just like in Notebook 0307.\n",
    "3. Write a single environment policy function that takes an observation integer and outputs an action integer as per the greedy policy following the $Q$-matrix.\n",
    "4. Use `run_episode` to generate a video.\n",
    "5. Print the episode return and show the video."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### For the Adventurous\n",
    "\n",
    "1. Select the learning rate the best $Q$-matrix was gotten with,\n",
    "2. crank up the number of steps and the evaluation interval and\n",
    "3. see if you can get through the 8x8 slippery lake!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References\n",
    "\n",
    "[1] Christopher J.C.H. Watkins and Peter Dayan: *$Q$-Learning*, 1992. Machine Learning, Volume 8, pp. 279-292. https://link.springer.com/article/10.1007/BF00992698"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset References\n",
    "\n",
    "Frozen Lake https://gymnasium.farama.org/environments/toy_text/frozen_lake/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# License\n",
    "\n",
    "This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}