3/12 Bellmann Operators and Q-Learning

Today, we'll learn some of the most basic RL algorithms: policy iteration, value iteration and $Q$-learning.

Throughout today's lecture, we'll work with finite MDPs, that is MDPs where the state space $\mathscr S$ and the action spaces $\mathscr A(s)$ are finite. Solving such a task is often referred to as Tabular RL.

Action-Value Functions

Recall that given a policy $\pi$, last time we defined the value function under $\pi$: $$ v_\pi(s)=\mathbf E_\pi(G_t|S_t=s), $$ the expected discounted return of trajectories that start at $s\in\mathscr S$, when following policy $\pi$.

Similarly, we define the action-value function under $\pi$: $$ q_\pi(s, a)=\mathbf E_\pi(G_t|S_t=s,A_t=a), $$ the expected discounted return of trajectories that start at $s\in\mathscr S$ by taking action $a\in\mathscr A(s)$, when following $\pi$ afterwards.

We denote by $v_*$ and $q_*$ the respective functions under an optimal policy.

Reward and Transition Distributions Known

First, we shall consider solution methods that require knowledge of the entire MDP: transition probabilities and rewards. We make the following, further restrictions:

The reward functions are deterministic, that is, given a state-action pair $(s, a)\in\mathscr S\times\mathscr A$, we get a fixed reward $r(s, a)\in\mathbf R$.
We use discount: $\gamma<1$.

Given these restrictions, let $|\mathscr S|=N$ and $|\mathscr A|=M$. Moreover, for $(s,a,s')\in\mathscr S\times\mathscr A\times\mathscr S$, let $P(s,a,s')$ denote the probability that if in state $s$, we make action $a$, we get in state $s'$.

Expected Bellmann Operator

Suppose given a policy $\mathscr S\xrightarrow\pi\mathscr P(\mathscr A)$ and an estimated value function $\mathscr S\xrightarrow{\tilde v_\pi}\mathbf R$. As we are dealing with a finite MDP, these can be respectively represented by

a matrix $\pi\in\mathbf R^{N\times M}:\pi_{s, a}=\mathbf P(A=a|S=s)$ and
a vector $\tilde v_\pi\in\mathbf R^N$.

Moreover, we can use

an expected reward vector $r(\pi)\in\mathbf R^n: r(\pi)_s=\sum_{a\in\mathscr A}r(s,a)\pi(s, a)$ and
a transition matrix $P(\pi)\in\mathbf R^{N\times N}: P(\pi)_{s, s'}=\sum_{a\in\mathscr A}\pi_{s,a}P(s,a,s')$¹.

The expected Bellmann operator improves the estimates $\tilde v_\pi$ by replacing them by the expected reward at the next step plus the discounted estimated value after:

\[ T^\pi\tilde v_\pi=r(\pi) + \gamma P(\pi)\cdot\tilde v_\pi. \]

Bellmann Optimality Equation

One may wonder if this is indeed an improved value function estimate. If $\gamma<1$, then using the Banach Fixed-Point Theorem, we can show that [1, Theorem 3.5 and Corollary 3.2]:

This iterative process indeed converges to the actual value function: we have $\lim_n(T^\pi)^n\tilde v_\pi=v_\pi$.
For any vector $v\in\mathbf R^N$, we have $T^\pi v=v$ if and only if we have $v=v_\pi$.

Note that in a finite MDP, point 2 means that instead of iteratively applying $T^\pi$, it is enough to show the system of linear equations

\[ T^\pi v=v. \]

Therefore, it is called the Bellmann Optimality Equation.

Policy Iteration

Given a policy $\pi\in\mathbf R^{N\times M}$, solving the Bellmann Optimality Equation, we can get its value function $v_\pi$. Now we seek to improve the policy, guided by the value function. Note that given the value function, we can get the action-value function as

\[ q_\pi(s, a)=r(s, a) + \gamma \sum_{s'\in \mathscr S}P(s,a,s')v_\pi(s'). \]

It turns out that this can guide us to find an optimal policy [1, Theorem 3.11]:

Suppose that we have $q_\pi(s, a)\le v_\pi(s)$ for all $(s,a)\in\mathscr S\times\mathscr A$. Then the policy $\pi$ is optimal.
Otherwise, let $\pi'(s)=\mathrm{argmax}_{a\in\mathscr A}q_\pi(s, a)$. Then we get $\pi'>\pi$.

This is the Policy Iteration algorithm. It can find an optimal deterministic policy in a finite number of steps.

Value Iteration

Note that in Policy Iteration, before each policy improvement step $\pi\leftarrow\pi'$, we solve the Bellmann Optimality Equation $T^\pi v=v$. This may be suboptimal for large finite MDP. Thus, in Value Iteration, we make improvements to the policy and the value function estimate in tandem.

Given an optimal value function estimate $\tilde v_*\in\mathbf R^N$, we get an optimal action-value estimate

\[ \tilde q_*(s, a)=r(s, a)+\gamma \sum_{s'\in\mathscr S}P(s,a,s')\tilde v_*(s'). \]

This in turns gives us a new optimal value function estimate

\[ (T\tilde v_*)(s) = \max_{a\in\mathscr A}\tilde q_*(s, a). \]

The operator $T$ is called the Bellmann Optimality Operator. The Banach Fixed-Point Theorem can be applied to this operator too to yield the following result [1, Theorem 3.6]:

For any vector $\tilde v_*\in\mathbf R^N$, we have $\lim_nT^n\tilde v_*=v_*$.
For any vector $\tilde v_*\in\mathbf R^N$, we have $T\tilde v_*=\tilde v_*$ if and only if we have $\tilde v_* = v_*$.

Thus, iteratively applying $T$ converges to the optimal value function. Based on this, we get the Value Iteration algorithm, which gives an approximate optimal value function and an approximately optimal policy [1, Theorem 3.24]:

Select a threshold $\epsilon>0$ and a starting approximate optimal value function $\tilde v_*\in\mathbf R^N$.
Repeat the update $\tilde v_*\leftarrow T\tilde v_*$ until we get $\|T\tilde v_*-\tilde v_*\|_\infty\le\frac{(1-\gamma)\epsilon}{2\gamma}$.
Letting $\pi(s)=\mathrm{argmax}_{a\in\mathscr A}\tilde q_*(s, a)$, we get ²:
1. an $\epsilon$-optimal policy, that is $\|v_\pi-v_*\|_\infty\le\epsilon$ and
2. an $\frac{\epsilon}{2}$-approximation of $v_*$, that is $\|\tilde v_*-v_*\|_\infty\le\frac{\epsilon}{2}$.

Reward and Transition Distributions Unknown

In what follows, we'll study an algorithm that does not require knowledge of the reward and transition distributions. Note that this is a more realistic scenario: we can only base our model on actual experience.

Q-Learning

This is an iterative algorithm. Let $Q_n(s, a)$ denote the approximate action-value function at step $n$. The we get the approximate value function

$$ V_n(s):=\max_{a\in\mathscr A(s)}Q_n(s, a). $$ We make infinite steps by restarting the MDP once we reach a terminal state. We also have learning rates $\alpha_n\in[0,1)$.

Suppose that at step $n$: in state $s_n\in\mathscr S$, we choose $a_n\in\mathscr A(s_n)$, get reward $r_n$ and get in state $s_{n+1}\in\mathscr S$. Then we let

\[ Q_{n+1}(s',a')=\begin{cases} (1-\alpha_n)Q_n(s,a) + \alpha_n(r_n + \gamma V_n(s,a)) & \text{if }s'=s\text{ and }a'=a\text{ and} \\ Q_n(s',a') & \text{otherwise.} \end{cases} \]

Theorem [2] Suppose that the following conditions hold:

The initial values $Q_0(s,a)$ are arbitrary, besides having $Q_0(s,a)=0$ for terminal states $s$.
The rewards are bounded: $|r_n|\le R$ for some $R\ge0$.
For $1\le i$, let $n^i(s,a)$ denote the index $n$ of the $i$-th time such that $s_n=s$ and $a_n=a$. Then for all $s\in\mathscr S$ and $a\in\mathscr A(s)$, we have $$ \sum_{i=1}^\infty\alpha_{n^i(s,a)}=\infty\text{ and } \sum_{i=1}^\infty\alpha_{n^i(s,a)}^2<\infty. $$

Then for all $s\in\mathscr S$ and $a\in\mathscr A(s)$, we have $$ Q_n(s,a)\to q_*(s,a)\text{ as }n\to\infty $$ with probability 1.

Q-Learning in Practice

So, we have a theorem that says if we follow Q-learning, then eventually, we get a nice policy. But how can we make this quicker?

Greedy Policy for Evaluation

When we want to evaluate our progress, that is get as great discounted returns as possible, we choose as policy the deterministic policy that maximizes the approximate action-state values: $$ \pi_\text{greedy}(s)=\mathop{\mathrm{argmax}}_{a\in\mathscr A(s)}Q_n(s,a). $$

$\epsilon$-Greedy Policy for Training

Note that for training, the theorem requires us to use policies that don't rule out any action entirely. Still, we would like to visit more promising states more often. A crude way to balance this is to use an $\epsilon$-greedy policy: This is the stochastic policy such that:

with probability $\epsilon$, we choose among the actions $\mathscr A(s)$ uniformly, and
otherwise, we use the greedy policy.

In other words, we have:

\[ \pi_\text{$\epsilon$-greedy}(a|s)=\begin{cases} 1-\epsilon + \frac{\epsilon}{|\mathscr A(s)|} & a=\pi_\text{greedy}(s) \\ \frac{\epsilon}{|\mathscr A(s)|} & \text{otherwise}. \end{cases} \]

It is common pratice to use a hyperparameter schedule for $\epsilon$: we start with $\epsilon=1$ as at first we have no knowledge of the environment and we decrease it as training progresses. In today's lab, we use a static schedule: with a fixed number of training steps, we linearly decrease $\epsilon$ from 1 to 0.

Optimistic Initialization

Note that we are free to pick the values $Q_0(s, a)$ for non-terminal states $s\in\mathscr S$. As the greedy and $\epsilon$-greedy policies will favor actions with large values of $Q_n$, the larger the initial values, the more the agent will explore as less visited values get less updates. Setting the initial values high is called optimistic initialization: the agent will think it is a good idea to try unexplored actions.

Validation in Reinforcement Learning

Note that in RL, there is no distinct validation set. On the other hand, a new dataset is generated in each episode. So we validate a policy by running a number of episodes. We'll see in the lab that this can be done in parallel to save time.

References

[1] Lodewijk Kallenberg: Markov Decision Processes. https://pub.math.leidenuniv.nl/~kallenberglcm//Lecture-notes-MDP.pdf

[2] Christopher J.C.H. Watkins and Peter Dayan: $Q$-Learning, 1992. Machine Learning, Volume 8, pp. 279-292. https://link.springer.com/article/10.1007/BF00992698

Viewing the transition map $\mathscr S\times\mathscr A\xrightarrow T\mathscr S$ as a 3-tensor $P\in\mathbf R^{N\times M\times N}$, the transition matrix $P(\pi)$ is the tensor dot product of $P$ and $\pi$ along the second dimension. ↩
For a vector $\mathbf x\in\mathbf R^N$, the notation $\|\mathbf x\|_\infty$ is for the supremum norm: $\|\mathbf x\|_\infty=\sup_{i=1}^N|x_i|$. ↩