3/7 Introduction to Reinforcement Learning

Today, we'll get acquainted with a new task type: Reinforcement Learning (RL). Here, we're teaching a model to make decisions based on a reward signal. The entity that makes decisions, or acts, is usually called an agent.

Note that this is closer to how living organisms learn: (usually) we are not shown the exact thing we need to do, but we get feedback to actions and we adapt accordingly.

This is a huge area of research. A nice, detailed reference textbook for non-deep reinforcement learning is Sutton--Barto: Reinforcement Learning: An Introduction [1]. It has numerous exercises to help you learn. In particular, implementing the toy environments they describe is pretty good coding practice.

Markov Decision Processes

We formalize many such problems as a Markov Decision Process (MDP). This is given by the following:

State Space

The state space $\mathscr S$ collects the possible states of the system.

It can be discrete, for example the set of possible positions in a board game.
It can be continuous, for example the position, velocity and acceleration of all players and the ball in a simulated ball game.

Markov Property

The Markov Property is the assumption that the system is entirely described by its state. This is why I mentioned a simulated ball game: what I gave as state does not describe eg. the weather during the game.

Initial State

We let $S_0$ denote the random variable with value the initial state of the system. That is for each state $s\in\mathscr S$, we are given a probability $p_0(s):=p(S_0=s)$ that the system starts in state $s$.

Terminal State

Oftentimes there are special states: terminal states. If the system enters such a state, then the process ends.

Action Space

When in state $s\in\mathscr S$, the agent can take one of the actions in the action space $\mathscr A(s)$. This can also be discrete or continuous. For convenience, usually we take a total action space $\mathscr A=\cup_{s\in\mathscr S}\mathscr A(s)$ and make actions $a\in\mathscr A-\mathscr A(s)$ invalid (illegal) when in state $s$ by either

stopping the process at once or
making the agent inert, if that makes sense in the environment.

For terminal states $s_T$, we let $\mathscr A(s_T)=\{*\}$.

Policy

The agent can be formalized as a mapping from states to actions: how should the agent react if the system is in a given state. It is usually denoted by $\pi$.

The policy can be deterministic. That is, we have a mapping $\mathscr S\xrightarrow\pi\mathscr A$: given a state $s\in\mathscr S$, the policy will always take the same action $\pi(s)\in\mathscr A(s)$.
The policy can be stochastic. That is, we have a mapping $\mathscr S\xrightarrow\pi\mathscr P(\mathscr A)$: given a state $s\in\mathscr S$, the probability that the agent will take action $a\in\mathscr A(s)$ is $\pi(a|s)$.

Given a state random variable $S_t$, for example the initial state random variable at time step $t=0$, we get an agent action random variable $A_t$.

Partially Observable Markov Decision Processes (POMDP)

Many times, it is impossible to obtain the actual, full state of a system. We can only gather partial information, summarized as an observation. Therefore, in this case, the policy is a function of observations and not states.

Transitions

If the system is in state $s\in\mathscr S$ and the agent takes action $a\in\mathscr A(s)$, then the agent receives a reward $r\in\mathbf R$ while the system transitions to another state $s'\in\mathscr S$. Just like a policy, this can also be either:

Deterministic: there is a fixed mapping $\mathscr S\times\mathscr A\xrightarrow t\mathbf R\times\mathscr S$ from state-action pairs $(s,a)$ to reward-next state pairs $(r,s')$.
Stochastic: the system is governed by a mapping $\mathscr S\times\mathscr A\xrightarrow{p(r,s'|s,a)}\mathscr P(\mathbf R\times\mathscr S)$ from state-action pairs to distributions of reward-next state pairs.

For a terminal state $s_T$, we let $t(s_T,*)=(0, s_T)$.

Given a state-action random variable pair $(S_t, A_t)$, we get a reward-next state random variable pair $(R_{t+1}, S_{t+1})$.

Trajectories

Thus, the MDP and the policy give rise to a sequence or trajectory $$ S_0,A_0,R_1,S_1,A_1,\dotsc $$

If the chain continues until a terminal state $S_T$ is reached, we call it an episode.

Returns

At each time step $t$, we consider the return $$ G_t=\sum_{k=0}^\infty\gamma^kR_{t+k+1}. $$ Here, the value $\gamma\in(0, 1]$ is the discount, a hyperparameter. Values smaller than 1 give the agent incentive to obtain rewards as soon as it can. A common baseline is $\gamma=0.99$.

Value- and Action-Value Functions

For a random variable $X$, we denote by $\mathbf E_\pi X$ the expected value of $X$ provided policy $\pi$ is followed.

Two very important functions using this are:

The Value Function under $\pi$: Given a state $s\in\mathscr S$, this is $$ v_\pi(s):=\mathbf E_\pi(G_t|S_t=s), $$ the expected return of trajectories starting with $s$, following $\pi$.
The Action-Value Function under $\pi$: If furthermore we fix an action $a\in\mathscr A(s)$, this is $$ q_\pi(s,a):=\mathbf E_\pi(G_t|S_t=s,A_t=a), $$ the expected return of trajectories starting with $s$ and $a$, following $\pi$.

Optimal Policy

Take two policies $\pi$ and $\pi'$. Then we write $\pi\ge\pi'$ if we have $$ v_\pi(s)\ge v_{\pi'}(s)\text{ for all }s\in\mathscr S. $$ Note that this relation equips the set of policies with a partially ordered set (poset) structure. A maximum element of this set is called an optimal policy $\pi^*$. That is, we need $\pi^*\ge\pi$ for all policies $\pi$.

At this point, one can ask which MDPs have an optimal policy. See [2] for a nice introduction with proofs. For example, in case of Frozen Lake, the MDP we solve in today's lab, an optimal policy exists by [2, Corollary 3.3]

References

[1] Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction, second edition, 2018. The MIT Press, Cambridge, Massachusetts, London, England. http://incompleteideas.net/book/the-book-2nd.html

[2] Lodewijk Kallenberg: Markov Decision Processes. https://pub.math.leidenuniv.nl/~kallenberglcm//Lecture-notes-MDP.pdf

Dataset References

[3] Frozen Lake https://gymnasium.farama.org/environments/toy_text/frozen_lake/