Skip to content

3/7 Introduction to Reinforcement Learning

Today, we'll get acquainted with a new task type: Reinforcement Learning (RL). Here, we're teaching a model to make decisions based on a reward signal. The entity that makes decisions, or acts, is usually called an agent.

Note that this is closer to how living organisms learn: (usually) we are not shown the exact thing we need to do, but we get feedback to actions and we adapt accordingly.

This is a huge area of research. A nice, detailed reference textbook for non-deep reinforcement learning is Sutton--Barto: Reinforcement Learning: An Introduction [1]. It has numerous exercises to help you learn. In particular, implementing the toy environments they describe is pretty good coding practice.

Markov Decision Processes

We formalize many such problems as a Markov Decision Process (MDP). This is given by the following:

State Space

The state space \(\mathscr S\) collects the possible states of the system.

  1. It can be discrete, for example the set of possible positions in a board game.
  2. It can be continuous, for example the position, velocity and acceleration of all players and the ball in a simulated ball game.

Markov Property

The Markov Property is the assumption that the system is entirely described by its state. This is why I mentioned a simulated ball game: what I gave as state does not describe eg. the weather during the game.

Initial State

We let \(S_0\) denote the random variable with value the initial state of the system. That is for each state \(s\in\mathscr S\), we are given a probability \(p_0(s):=p(S_0=s)\) that the system starts in state \(s\).

Terminal State

Oftentimes there are special states: terminal states. If the system enters such a state, then the process ends.

Action Space

When in state \(s\in\mathscr S\), the agent can take one of the actions in the action space \(\mathscr A(s)\). This can also be discrete or continuous. For convenience, usually we take a total action space \(\mathscr A=\cup_{s\in\mathscr S}\mathscr A(s)\) and make actions \(a\in\mathscr A-\mathscr A(s)\) invalid (illegal) when in state \(s\) by either

  1. stopping the process at once or
  2. making the agent inert, if that makes sense in the environment.

For terminal states \(s_T\), we let \(\mathscr A(s_T)=\{*\}\).

Policy

The agent can be formalized as a mapping from states to actions: how should the agent react if the system is in a given state. It is usually denoted by \(\pi\).

  1. The policy can be deterministic. That is, we have a mapping \(\mathscr S\xrightarrow\pi\mathscr A\): given a state \(s\in\mathscr S\), the policy will always take the same action \(\pi(s)\in\mathscr A(s)\).
  2. The policy can be stochastic. That is, we have a mapping \(\mathscr S\xrightarrow\pi\mathscr P(\mathscr A)\): given a state \(s\in\mathscr S\), the probability that the agent will take action \(a\in\mathscr A(s)\) is \(\pi(a|s)\).

Given a state random variable \(S_t\), for example the initial state random variable at time step \(t=0\), we get an agent action random variable \(A_t\).

Partially Observable Markov Decision Processes (POMDP)

Many times, it is impossible to obtain the actual, full state of a system. We can only gather partial information, summarized as an observation. Therefore, in this case, the policy is a function of observations and not states.

Transitions

If the system is in state \(s\in\mathscr S\) and the agent takes action \(a\in\mathscr A(s)\), then the agent receives a reward \(r\in\mathbf R\) while the system transitions to another state \(s'\in\mathscr S\). Just like a policy, this can also be either:

  1. Deterministic: there is a fixed mapping \(\mathscr S\times\mathscr A\xrightarrow t\mathbf R\times\mathscr S\) from state-action pairs \((s,a)\) to reward-next state pairs \((r,s')\).
  2. Stochastic: the system is governed by a mapping \(\mathscr S\times\mathscr A\xrightarrow{p(r,s'|s,a)}\mathscr P(\mathbf R\times\mathscr S)\) from state-action pairs to distributions of reward-next state pairs.

For a terminal state \(s_T\), we let \(t(s_T,*)=(0, s_T)\).

Given a state-action random variable pair \((S_t, A_t)\), we get a reward-next state random variable pair \((R_{t+1}, S_{t+1})\).

Trajectories

Thus, the MDP and the policy give rise to a sequence or trajectory $$ S_0,A_0,R_1,S_1,A_1,\dotsc $$

If the chain continues until a terminal state \(S_T\) is reached, we call it an episode.

Returns

At each time step \(t\), we consider the return $$ G_t=\sum_{k=0}^\infty\gamma^kR_{t+k+1}. $$ Here, the value \(\gamma\in(0, 1]\) is the discount, a hyperparameter. Values smaller than 1 give the agent incentive to obtain rewards as soon as it can. A common baseline is \(\gamma=0.99\).

Value- and Action-Value Functions

For a random variable \(X\), we denote by \(\mathbf E_\pi X\) the expected value of \(X\) provided policy \(\pi\) is followed.

Two very important functions using this are:

  1. The Value Function under \(\pi\): Given a state \(s\in\mathscr S\), this is $$ v_\pi(s):=\mathbf E_\pi(G_t|S_t=s), $$ the expected return of trajectories starting with \(s\), following \(\pi\).
  2. The Action-Value Function under \(\pi\): If furthermore we fix an action \(a\in\mathscr A(s)\), this is $$ q_\pi(s,a):=\mathbf E_\pi(G_t|S_t=s,A_t=a), $$ the expected return of trajectories starting with \(s\) and \(a\), following \(\pi\).

Optimal Policy

Take two policies \(\pi\) and \(\pi'\). Then we write \(\pi\ge\pi'\) if we have $$ v_\pi(s)\ge v_{\pi'}(s)\text{ for all }s\in\mathscr S. $$ Note that this relation equips the set of policies with a partially ordered set (poset) structure. A maximum element of this set is called an optimal policy \(\pi^*\). That is, we need \(\pi^*\ge\pi\) for all policies \(\pi\).

At this point, one can ask which MDPs have an optimal policy. See [2] for a nice introduction with proofs. For example, in case of Frozen Lake, the MDP we solve in today's lab, an optimal policy exists by [2, Corollary 3.3]

References

[1] Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction, second edition, 2018. The MIT Press, Cambridge, Massachusetts, London, England. http://incompleteideas.net/book/the-book-2nd.html

[2] Lodewijk Kallenberg: Markov Decision Processes. https://pub.math.leidenuniv.nl/~kallenberglcm//Lecture-notes-MDP.pdf

Dataset References

[3] Frozen Lake https://gymnasium.farama.org/environments/toy_text/frozen_lake/