It has been a pleasure reading through the second edition of the reinforcement learning (RL) textbook by Sutton and Barto, freely available online. From my day-to-day work, I am familiar with the vast majority of the textbook’s material, but there are still a few concepts that I have not fully internalized, or “grokked” if you prefer that terminology. Those concepts sometimes appear in the research literature that I read, and while I have intuition, a stronger understanding would be preferable.

Another motivating factor for me to read the textbook is that I work with function approximation and deep learning nearly every day, so I rarely get the chance to practice, or even review, the exact, tabular versions of the algorithms I’m using. I also don’t get to review the theory on those algorithms, because I work in neural network space. I always fear I will forget the fundamentals. Thus, during some of my evenings, weekends, and travels, I have been reviewing Sutton and Barto, along with other foundational textooks in similar fields. (I should probably update my old blog post about “friendly” textbooks!)

Sutton and Barto’s book is the standard textbook in reinforcement learning, and for good reason. It is relatively easy to read, and provides sufficient justification and background for the algorithms and concepts presented. The organization is solid. Finally, it has thankfully been updated in 2018 to reflect more recent developments. To be clear: it is not a deep reinforcement learning textbook, but knowing basic reinforcement learning is a prerequisite before applying deep neural networks, so it is better to have one textbook devoted to foundations.

Thus far, I’ve read most of the first half of the book, which covers bandit problems, the Markov Decision Process (MDP) formulation, and methods for solving (tabular) MDPs via dynamic programming, Monte Carlo, and temporal difference learning.

I appreciated a review of bandit problems. I knew about the $k$-armed bandit problem from reading papers such as RL-squared, which is the one that Professor Abbeel usually presents at the start of his meta-RL talks, but it was nice to see it in a textbook. Bandit problems are probably as far from my research as an RL concept can get, despite how I think they are more widely used in industry than “true” RL problems, but nonetheless I think I’ll briefly discuss them here because why not?

Suppose we have an agent which is taking actions in an environment. There are two cases:

  • The agent’s action will not affect the distribution of the subsequent situation it sees. This is a bandit problem. (I use “situation” to refer to both states and the reward distribution in $k$-armed bandit problems.) These can further be split up as nonassociative or associative. In the former, there is only one situation in the environment. In the latter, there are multiple situations, and this is often referred to as contextual bandits. A simple example would be if an environment has several $k$-armed bandits, and at each time, one of them is drawn at random. Despite the seemingly simplicity of the bandit problem, there is already a rich exploration-exploitation problem because the agent has to figure out which of $k$ actions (“arms”) to pull. Exploitation is optimal if we have one time step left, but what if we have 1000 left? Fortunately, this simple setting allows for theory and extensive numerical simulations.

  • The agent’s action will affect the distribution of subsequent situations. This is a reinforcement learning problem.

If the second case above is not true for a given task, then do not use RL. A lot of problems can be formulated as RL — I’ve seen cases ranging from protein folding to circuit design to compiler optimization — but that is different from saying that all problems make sense in a reinforcement learning context.

We now turn to reinforcement learning. I list some of the relevant notation and equations. As usual, when reading the book, it’s good practice to try and work out the definitions of equations before they are actually presented.

  • They use $R_{t+1}$ to indicate the reward due to the action at time $t$, i.e., $A_t$. Unfortunately, as they say, both conventions are used in the literature. I prefer $R_t$ as the reward at time $t$, partially because I think it’s the convention at Berkeley. Maybe people don’t want to write another “+1” in LaTeX.

  • The lowercase “$r$” is used to represent functions, and it can be a function of state-action pairs $r : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$, or state-action-state triples $r : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}$, where the second state is the successor state. I glossed over this in my August 2015 post on MDPs, where I said: “In general, I will utilize the second formulation [the $r(s,a,s’)$ case], but the formulations are not fundamentally different.” Actually, what I probably should have said is that either formulation is valid and the difference likely comes down to whether it “makes sense” for a reward to directly depend on the successor state.

    In OpenAI gym-style implementations, it can go either way, because we usually call something like: new_obs, rew, done, info = env.step(action), so the new observation new_obs and reward rew are returned simultaneously. The environment code therefore decides whether it wants to make use of the successor state or not in the reward computation.

    In Sutton and Barto’s notation, the reward function can be this for the first case:

    \[r(s,a) = \mathbb{E}\Big[ R_t \mid S_{t-1}=s, A_{t-1}=a \Big] = \sum_{r \in \mathcal{R}} r \cdot p(r | s,a) = \sum_{r \in \mathcal{R}} r \cdot \sum_{s' \in \mathcal{S}} p(s', r | s,a)\]

    where we first directly apply the definition of a conditional expectation, and then do an extra marginalization over the $s’$ term because Sutton and Barto define the dynamics in terms of the function $p(s’,r | s,a)$ rather than the $p(s’|s,a)$ that I’m accustomed to using. Thus, I used the function $p$ to represent (in a slight abuse of notation) the probability mass function of a reward, or reward and successor state combination.

    Similarly, the second case can be written as:

    \[r(s,a,s') = \mathbb{E}\Big[ R_t \mid S_{t-1}=s, A_{t-1}=a, S_{t}=s' \Big] = \sum_{r \in \mathcal{R}} r \cdot \frac{p(s', r | s,a)}{p(s' | s,a)}\]

    where now we have the $s’$ given to us, so there’s no need to sum over it. If we are summing over possible reward values, we will also use $r$.

  • The expected return, which the agent wants to maximize, is $G_t$. I haven’t seen this notation used very often, and I think I only remember it because it appeared in the Rainbow DQN paper. Most papers just write something similar to $\mathbb{E}[\sum_{t=0}^{\infty} \gamma^tR_t]$, where the sum starts at 0 because that’s where the agent starts.

    Formally, we have:

    \[G_t = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k\]

    where we might have $T=\infty$, or $\gamma=1$, but both cannot be true. This is their notation for combining episodic and infinite-horizon tasks.

    Suppose that $T=\infty$. Then we can write the expected return $G_t$ in a recursive fashion:

    \[G_t = R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \cdots) = R_{t+1} + \gamma G_{t+1}.\]

    From skimming various proofs in RL papers, recursion frequently appears, so it’s probably a useful skill to master. The geometric series is also worth remembering, particularly when the reward is a fixed number at each time step, since then there is a sum and a “common ratio” of $\gamma$ between successive terms.

  • Finally, we have the all important value function in reinforcement learning. These are usually state values or state-action values, but others are possible, such as advantage functions. The book’s notation is to use lowercase letters, i.e.: $v_\pi(s)$ and $q_\pi(s,a)$ for state and state-value functions. Sadly, the literature often uses $V_\pi$ and $Q_\pi(s,a)$ instead, but as long as we know what we’re talking about, the notation gets abstracted away. These functions are:

    \[v_\pi(s) = \mathbb{E}_\pi\Big[G_t \mid S_t=s\Big]\]

    for all states $s$, and

    \[q_\pi(s,a) = \mathbb{E}_\pi\Big[G_t \mid S_t=s, A_t=a\Big]\]

    for all states and action pairs $(s,a)$. Note the need to have $\pi$ under the expectation!

That’s all I will bring up for now. I encourage you to check the book for a more complete treatment of notation.

A critical concept to understand in reinforcement learning is the Bellman equation. This is a recursive equation that defines a policy with respect to itself, effectively providing a “self consistency” condition (if that makes sense). We can write the Bellman equation for the most interesting policy, the optimal one $\pi_*(s)$, as

\[\begin{align} v_*(s) &= \mathbb{E}_{\pi_*}\Big[G_t \mid S_t=s\Big] \\ &{\overset{(i)}=}\;\; \mathbb{E}_{\pi_*}\Big[R_{t+1} + \gamma G_{t+1} \mid S_t=s\Big] \\ &{\overset{(ii)}=}\; \sum_{a,r,s'} p(a,r,s'|s) \Big[r + \gamma \mathbb{E}_{\pi_*}[G_{t+1} \mid S_{t+1}=s']\Big] \\ &{\overset{(iii)}=}\; \sum_{a} \pi_*(a|s) \sum_{r,s'} p(r,s'|a,s) \Big[r + \gamma \mathbb{E}_{\pi_*}[G_{t+1} \mid S_{t+1}=s']\Big] \\ &{\overset{(iv)}=}\; \max_{a} \sum_{r,s'} p(r,s'|a,s) \Big[r + \gamma \mathbb{E}_{\pi_*}[G_{t+1} \mid S_{t+1}=s']\Big] \\ &{\overset{(v)}=}\; \max_{a} \sum_{s',r} p(r,s'|s,a) \Big[r + \gamma v_*(s')\Big] \end{align}\]


  • in (i), we apply the recurrence on $G_t$ as described earlier.
  • in (ii), we convert the expectation into its definition in the form of a sum over all possible values of the probability mass function $p(a,r,s’|s)$ and the subsequent value being taken under the expectation. The $r$ is now isolated and we condition on $s’$ instead of $s$ since we’re dealing with the next return $G_{t+1}$.
  • in (iii) we use the chain rule of probability to split the density $p$ into the policy $\pi_*$ and the “rest of” $p$ in an abuse of notation (sorry), and then push the sums as far to the right as possible.
  • in (iv) we use the fact that the optimal policy will take only the action that maximizes the value of the subsequent expression, i.e., the expected value of the reward plus the discounted value after that.
  • finally, in (v) we convert the $G_{t+1}$ into the equivalent $v_*(s’)$ expression.

In the above, I use $\sum_{x,y}$ as shorthand for $\sum_x\sum_y$.

When trying to derive these equations, I think the tricky part comes when figuring out when it’s valid to turn a random variable (in capital letters) into one of its possible instantiations (a lowercase letter). Here, we’re dealing with policies that determine an action given a state. The environment subsequently generates a return and a successor state, so these are the values we can sum over (since we assume a discrete MDP). The expected return $G_t$ cannot be summed over and must remain inside an expectation, or converted to an equivalent definition.

In the following chapter on dynamic programming techniques, the book presents the policy improvement theorem. It’s one of the few theorems with a proof in the book, and relies on similar “recursive” techniques as shown in the Bellman equation above.

Suppose that $\pi$ and $\pi’$ are any pair of deterministic policies such that, for all states $s \in \mathcal{S}$, we have $q_\pi(s,\pi’(s)) \ge v_\pi(s)$. Then the policy $\pi’$ is as good as (or better than) $\pi$, which equivalently means $v_{\pi’}(s) \ge v_\pi(s)$ for all states. Be careful about noticing which policy is under the value function.

The proof starts from the given and ends with the claim. For any $s$, we get:

\[\begin{align} v_\pi(s) &\le q_\pi(s,\pi'(s)) \\ &{\overset{(i)}=}\;\; \mathbb{E}\Big[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=\pi'(s) \Big] \\ &{\overset{(ii)}=}\; \mathbb{E}_{\pi'}\Big[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s \Big] \\ &{\overset{(iii)}\le}\; \mathbb{E}_{\pi'}\Big[ R_{t+1} + \gamma q_\pi(S_{t+1}, \pi'(S_{t+1})) \mid S_t=s \Big] \\ &{\overset{(iv)}=}\; \mathbb{E}_{\pi'}\Big[ R_{t+1} + \gamma \cdot \mathbb{E}\Big[R_{t+2} + \gamma v_\pi(S_{t+2}) \mid S_{t+1}, A_{t+1}=\pi'(S_{t+1})\Big] \mid S_t=s \Big] \\ &{\overset{(v)}=}\;\; \mathbb{E}_{\pi'}\Big[ R_{t+1} + \gamma R_{t+2} + \gamma^2 v_\pi(S_{t+2}) \mid S_t=s \Big] \\ &{\overset{(vi)}\le}\; \mathbb{E}_{\pi'}\Big[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 v_\pi(S_{t+3}) \mid S_t=s \Big] \\ &\vdots \\ &\le\; \mathbb{E}_{\pi'} \Big[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3R_{t+4} + \cdots \mid S_t = s\Big] \\ &=\; v_{\pi'}(s)... \\ \end{align}\]


  • in (i) we expand the left hand side by definition, and in particular, the action we condition on for the Q-values are from $\pi’(s)$. I’m not doing an expectation w.r.t. a given policy because we have the action already given to us, hence the “density” here is from the environment dynamics.
  • in (ii) we remove the conditioning on the action in the expectation, and make the expectation w.r.t. the policy $\pi’$ now. Intuitively, this is valid because by taking an expectation w.r.t. the (deterministic) $\pi’$, given that the state is already conditioned upon, the policy will deterministically provide the same action $A_t=\pi’(s)$ as in the previous line. If this is confusing, think of the expectation under $\pi’$ as creating an outer sum $\sum_{a}\pi’(a|s)$ before the rest of the expectation. However, since $\pi’$ is deterministic, it will be equal to one only under one of the actions, the “$\pi’(s)$” we’ve been writing.
  • in (iii) we apply the theorem’s assumption.
  • in (iv) we do a similar thing as (i) by expanding $q_\pi$, and conditioning on random variables rather than a fixed instantiation $s$ since we are not given one.
  • in (v) we apply a similar trick as earlier, by moving the conditioning on the action under the expectation, so that the inner expectation turns into “$\mathbb{E}_{\pi’}$”. To simplify, we move the nner expectation out to merge with the outermost expectation.
  • in (vi) we recursively expand based on the inequality of (ii) vs (v).
  • then finally, after repeated application, we get to the claim.

One obvious implication of the proof above is that, if we have two policies that are exactly the same, except for one state where $\pi’(s) \ne \pi(s)$, then if the condition holds in the theorem above, $\pi’$ is a strictly better policy.

The generalized policy iteration subsection in the same chapter is worth reading. It describes, in one page, the general idea of learning policies via interaction between policy evaluation and policy improvement.

I often wished the book had more proofs of its claims, but then I realized it wouldn’t be suitable as an introduction to reinforcement learning. For the theory, I’m going through Chapter 6 of Dynamic Programming and Optimal Control by Dimitri P. Bertsekas.

It’s a pleasure to review Sutton and Barto’s book and compare how much more I know now than I did when first studying reinforcement learning in a clumsy on-and-off way from 2013 to 2016. Coming up next will be, I promise, discussion of the more technical and challenging concepts in the textbook.