3/12 Bellmann Operators and Q-Learning
Today, we'll learn some of the most basic RL algorithms: policy iteration, value iteration and \(Q\)-learning.
Throughout today's lecture, we'll work with finite MDPs, that is MDPs where the state space \(\mathscr S\) and the action spaces \(\mathscr A(s)\) are finite. Solving such a task is often referred to as Tabular RL.
Action-Value Functions
Recall that given a policy \(\pi\), last time we defined the value function under \(\pi\): $$ v_\pi(s)=\mathbf E_\pi(G_t|S_t=s), $$ the expected discounted return of trajectories that start at \(s\in\mathscr S\), when following policy \(\pi\).
Similarly, we define the action-value function under \(\pi\): $$ q_\pi(s, a)=\mathbf E_\pi(G_t|S_t=s,A_t=a), $$ the expected discounted return of trajectories that start at \(s\in\mathscr S\) by taking action \(a\in\mathscr A(s)\), when following \(\pi\) afterwards.
We denote by \(v_*\) and \(q_*\) the respective functions under an optimal policy.
Reward and Transition Distributions Known
First, we shall consider solution methods that require knowledge of the entire MDP: transition probabilities and rewards. We make the following, further restrictions:
- The reward functions are deterministic, that is, given a state-action pair \((s, a)\in\mathscr S\times\mathscr A\), we get a fixed reward \(r(s, a)\in\mathbf R\).
- We use discount: \(\gamma<1\).
Given these restrictions, let \(|\mathscr S|=N\) and \(|\mathscr A|=M\). Moreover, for \((s,a,s')\in\mathscr S\times\mathscr A\times\mathscr S\), let \(P(s,a,s')\) denote the probability that if in state \(s\), we make action \(a\), we get in state \(s'\).
Expected Bellmann Operator
Suppose given a policy \(\mathscr S\xrightarrow\pi\mathscr P(\mathscr A)\) and an estimated value function \(\mathscr S\xrightarrow{\tilde v_\pi}\mathbf R\). As we are dealing with a finite MDP, these can be respectively represented by
- a matrix \(\pi\in\mathbf R^{N\times M}:\pi_{s, a}=\mathbf P(A=a|S=s)\) and
- a vector \(\tilde v_\pi\in\mathbf R^N\).
Moreover, we can use
- an expected reward vector \(r(\pi)\in\mathbf R^n: r(\pi)_s=\sum_{a\in\mathscr A}r(s,a)\pi(s, a)\) and
- a transition matrix \(P(\pi)\in\mathbf R^{N\times N}: P(\pi)_{s, s'}=\sum_{a\in\mathscr A}\pi_{s,a}P(s,a,s')\)1.
The expected Bellmann operator improves the estimates \(\tilde v_\pi\) by replacing them by the expected reward at the next step plus the discounted estimated value after:
Bellmann Optimality Equation
One may wonder if this is indeed an improved value function estimate. If \(\gamma<1\), then using the Banach Fixed-Point Theorem, we can show that [1, Theorem 3.5 and Corollary 3.2]:
- This iterative process indeed converges to the actual value function: we have \(\lim_n(T^\pi)^n\tilde v_\pi=v_\pi\).
- For any vector \(v\in\mathbf R^N\), we have \(T^\pi v=v\) if and only if we have \(v=v_\pi\).
Note that in a finite MDP, point 2 means that instead of iteratively applying \(T^\pi\), it is enough to show the system of linear equations
Therefore, it is called the Bellmann Optimality Equation.
Policy Iteration
Given a policy \(\pi\in\mathbf R^{N\times M}\), solving the Bellmann Optimality Equation, we can get its value function \(v_\pi\). Now we seek to improve the policy, guided by the value function. Note that given the value function, we can get the action-value function as
It turns out that this can guide us to find an optimal policy [1, Theorem 3.11]:
- Suppose that we have \(q_\pi(s, a)\le v_\pi(s)\) for all \((s,a)\in\mathscr S\times\mathscr A\). Then the policy \(\pi\) is optimal.
- Otherwise, let \(\pi'(s)=\mathrm{argmax}_{a\in\mathscr A}q_\pi(s, a)\). Then we get \(\pi'>\pi\).
This is the Policy Iteration algorithm. It can find an optimal deterministic policy in a finite number of steps.
Value Iteration
Note that in Policy Iteration, before each policy improvement step \(\pi\leftarrow\pi'\), we solve the Bellmann Optimality Equation \(T^\pi v=v\). This may be suboptimal for large finite MDP. Thus, in Value Iteration, we make improvements to the policy and the value function estimate in tandem.
Given an optimal value function estimate \(\tilde v_*\in\mathbf R^N\), we get an optimal action-value estimate
This in turns gives us a new optimal value function estimate
The operator \(T\) is called the Bellmann Optimality Operator. The Banach Fixed-Point Theorem can be applied to this operator too to yield the following result [1, Theorem 3.6]:
- For any vector \(\tilde v_*\in\mathbf R^N\), we have \(\lim_nT^n\tilde v_*=v_*\).
- For any vector \(\tilde v_*\in\mathbf R^N\), we have \(T\tilde v_*=\tilde v_*\) if and only if we have \(\tilde v_* = v_*\).
Thus, iteratively applying \(T\) converges to the optimal value function. Based on this, we get the Value Iteration algorithm, which gives an approximate optimal value function and an approximately optimal policy [1, Theorem 3.24]:
- Select a threshold \(\epsilon>0\) and a starting approximate optimal value function \(\tilde v_*\in\mathbf R^N\).
- Repeat the update \(\tilde v_*\leftarrow T\tilde v_*\) until we get \(\|T\tilde v_*-\tilde v_*\|_\infty\le\frac{(1-\gamma)\epsilon}{2\gamma}\).
-
Letting \(\pi(s)=\mathrm{argmax}_{a\in\mathscr A}\tilde q_*(s, a)\), we get 2:
- an \(\epsilon\)-optimal policy, that is \(\|v_\pi-v_*\|_\infty\le\epsilon\) and
- an \(\frac{\epsilon}{2}\)-approximation of \(v_*\), that is \(\|\tilde v_*-v_*\|_\infty\le\frac{\epsilon}{2}\).
Reward and Transition Distributions Unknown
In what follows, we'll study an algorithm that does not require knowledge of the reward and transition distributions. Note that this is a more realistic scenario: we can only base our model on actual experience.
Q-Learning
This is an iterative algorithm. Let \(Q_n(s, a)\) denote the approximate action-value function at step \(n\). The we get the approximate value function
$$ V_n(s):=\max_{a\in\mathscr A(s)}Q_n(s, a). $$ We make infinite steps by restarting the MDP once we reach a terminal state. We also have learning rates \(\alpha_n\in[0,1)\).
Suppose that at step \(n\): in state \(s_n\in\mathscr S\), we choose \(a_n\in\mathscr A(s_n)\), get reward \(r_n\) and get in state \(s_{n+1}\in\mathscr S\). Then we let
Theorem [2] Suppose that the following conditions hold:
- The initial values \(Q_0(s,a)\) are arbitrary, besides having \(Q_0(s,a)=0\) for terminal states \(s\).
- The rewards are bounded: \(|r_n|\le R\) for some \(R\ge0\).
- For \(1\le i\), let \(n^i(s,a)\) denote the index \(n\) of the \(i\)-th time such that \(s_n=s\) and \(a_n=a\). Then for all \(s\in\mathscr S\) and \(a\in\mathscr A(s)\), we have $$ \sum_{i=1}^\infty\alpha_{n^i(s,a)}=\infty\text{ and } \sum_{i=1}^\infty\alpha_{n^i(s,a)}^2<\infty. $$
Then for all \(s\in\mathscr S\) and \(a\in\mathscr A(s)\), we have $$ Q_n(s,a)\to q_*(s,a)\text{ as }n\to\infty $$ with probability 1.
Q-Learning in Practice
So, we have a theorem that says if we follow Q-learning, then eventually, we get a nice policy. But how can we make this quicker?
Greedy Policy for Evaluation
When we want to evaluate our progress, that is get as great discounted returns as possible, we choose as policy the deterministic policy that maximizes the approximate action-state values: $$ \pi_\text{greedy}(s)=\mathop{\mathrm{argmax}}_{a\in\mathscr A(s)}Q_n(s,a). $$
\(\epsilon\)-Greedy Policy for Training
Note that for training, the theorem requires us to use policies that don't rule out any action entirely. Still, we would like to visit more promising states more often. A crude way to balance this is to use an \(\epsilon\)-greedy policy: This is the stochastic policy such that:
- with probability \(\epsilon\), we choose among the actions \(\mathscr A(s)\) uniformly, and
- otherwise, we use the greedy policy.
In other words, we have:
It is common pratice to use a hyperparameter schedule for \(\epsilon\): we start with \(\epsilon=1\) as at first we have no knowledge of the environment and we decrease it as training progresses. In today's lab, we use a static schedule: with a fixed number of training steps, we linearly decrease \(\epsilon\) from 1 to 0.
Optimistic Initialization
Note that we are free to pick the values \(Q_0(s, a)\) for non-terminal states \(s\in\mathscr S\). As the greedy and \(\epsilon\)-greedy policies will favor actions with large values of \(Q_n\), the larger the initial values, the more the agent will explore as less visited values get less updates. Setting the initial values high is called optimistic initialization: the agent will think it is a good idea to try unexplored actions.
Validation in Reinforcement Learning
Note that in RL, there is no distinct validation set. On the other hand, a new dataset is generated in each episode. So we validate a policy by running a number of episodes. We'll see in the lab that this can be done in parallel to save time.
References
[1] Lodewijk Kallenberg: Markov Decision Processes. https://pub.math.leidenuniv.nl/~kallenberglcm//Lecture-notes-MDP.pdf
[2] Christopher J.C.H. Watkins and Peter Dayan: \(Q\)-Learning, 1992. Machine Learning, Volume 8, pp. 279-292. https://link.springer.com/article/10.1007/BF00992698
-
Viewing the transition map \(\mathscr S\times\mathscr A\xrightarrow T\mathscr S\) as a 3-tensor \(P\in\mathbf R^{N\times M\times N}\), the transition matrix \(P(\pi)\) is the tensor dot product of \(P\) and \(\pi\) along the second dimension. ↩
-
For a vector \(\mathbf x\in\mathbf R^N\), the notation \(\|\mathbf x\|_\infty\) is for the supremum norm: \(\|\mathbf x\|_\infty=\sup_{i=1}^N|x_i|\). ↩