Homework 6

Enhance the function q_learning you implemented in Notebook 0312 with the following learning rate schedule: Just like in [1], for state $s\in\mathscr S$, action $a\in\mathscr A(s)$ and count $1\le i$, let $n^i(s, a)$ denote the step index where in state $s$, action $a$ was chosen for the $i$-th time. Then as learning rate, we choose $$ \alpha_{n^i(s,a)}:=\frac{\alpha}{i^\tau}, $$ where the initial learning rate $\alpha$ and the temperature $\tau$ are hyperparameters. Note that the condition of [1, Theorem] is satisfied if and only if $0.5<\tau\le1$.

Implementation hint: in state $s$ and action $a$, in the $Q$-update, you need to multiply the initial learning rate $\alpha$ by $i^{-\tau}$, where $i$ is the number of times $a$ was chosen in state $s$. To get this, you can create a 3D tensor of state-action pair visitation counts per training environment.
1. This should be a 3D tensor of the same shape as the tensor of $Q$-values. You can initialize it with zeros.
2. At each training step, before you apply the $Q$-update, you can get the vector of counts at each current observation and the corresponding actions taken via the same advanced indexing procedure that you got the vector of $Q$-values at each current observation and the corresponding actions taken.
3. First, increment the entries of the count vector by 1.
4. Then you can use it to adjust the learning rate in the $Q$-update.
To help you implement the enhanced function I'm including the updated signature and docstring in the end of the assignment.
Take the configuration dictionary of Notebook 0312. Change the map size to "4x4" so that training times are smaller. However, make it slippery for stochastic transitions.
Run grid search with the following settings. Try different initial learning rates in parallel.
1. Learning rate schedule temperature $0, 0.5, \dotsc, 2$.
2. Initial $Q$-value $0, 0.25,\dotsc,1$.
3. Initial learning rate $10^{-2},10^{-1.5},\dotsc,10^1$.
During the grid search, collect the best average return values for each triple of hyperparameters.
Summarize your results in a heatmap, like in Notebook 0226. The heatmap should show for each learning rate schedule temperature and initial $Q$-value the best average return value seen among various initial learning rates.

def q_learning(
    config: dict
) -> dict:
    """
    Q-learning training loop on a vectorized environment
    with optionally different learning rates.

    Parameters
    ----------
    config : `dict`
        Configuration dictionary. Required values:
        discount : `float`
            Discount to use when calculating the discounted return.
        env_id : `str`
            The identifier of the environment.
        eval_interval: `int`
            The frequency of evaluations,
            measured in train steps. Set this to 1000.
        env_kwargs : `dict`
            Extra keyword arguments of the environment.
        env_num_eval : `int`
            Number of evaluation environments.
        env_num_train : `int`
            Number of training environments.
        improvement_threshold: `float`
            In evaluation, we should get a result
            at least this much better than the previous best
            to count as an improvement, for numerical stability.
        learning_rate: `int | torch.Tensor`
            Either a constant learning rate
            to use in all training environments
            or a different one for each.
        learning_rate_schedule_temperature: `float`
            During a Q-update, the learning rate is multiplied by
            the reciprocal of the number of times
            the given action was taken upon the given observation,
            raised to the power given by this value.
        steps_num : `int`
            Number of training steps.

    Returns
    -------
    A dictionary with the following key-value pairs:
        best_avg_return : `torch.Tensor`
            The best average discounted returns
            per training environment.
        best_q_values : `torch.Tensor`
            The best Q-matrices
            per training environment as a 3D tensor.
        eval_returns : `torch.Tensor`
            The tensor collecting, for each
            1. evaluation,
            2. training environment and
            3. evaluation environment

            the discounted returns.
        eval_steps : `torch.Tensor`
            The vector collecting the number of training steps taken
            at each evaluation.
    """

References

[1] Christopher J.C.H. Watkins and Peter Dayan: $Q$-Learning, 1992. Machine Learning, Volume 8, pp. 279-292. link