In my last post, I briefly mentioned that there were two relevant follow-up papers to the DQfD one: Distributed Prioritized Experience Replay (PER) and the Ape-X DQfD algorithm. In this post, I will briefly review them, along with another relevant follow-up, Kickstarting Deep Reinforcement Learning.

## Distributed PER

This is the epitome of a strong empirical Deep Reinforcement Learning paper. The main idea is to add many actors running in parallel to collect samples. The actors periodically pool their samples into a centralized data repository, which is used for experience replay for a centralized learner to update neural network parameters. Those parameters then get copied back to the actors. That’s it!

Here’s what they have to say:

In this paper we describe an approach to scaling up deep reinforcement learning by generating more data and selecting from it in a prioritized fashion (Schaul et al., 2016). Standard approaches to distributed training of neural networks focus on parallelizing the computation of gradients, to more rapidly optimize the parameters (Dean et al., 2012). In contrast, we distribute the generation and selection of experience data, and find that this alone suffices to improve results. This is complementary to distributing gradient computation, and the two approaches can be combined, but in this work we focus purely on data-generation.

My first question after reading the introduction was about how Distributed PER differs from their two most relevant prior works: distributed Deep RL, also known as the “Gorila” paper (Nair et al., 2015) and A3C (Mnih et al., 2016). As far as I can tell, here are the differences:

• The Gorila paper used a distributed experience replay buffer (and no prioritization). This paper uses a centralized buffer, with prioritization. There is also a slight algorithmic change as to when priorities are actually computed — they are initialized from the actors, not the learners, so that there isn’t too much priority focused on the most recent samples.

• The A3C paper used one CPU and had multithreading for parallelism, with one copy of the environment per thread. They did not use experience replay, arguing that the diversity of samples from the different environments alleviates the need for replay buffers.

That … seems to be it! The algorithm is not particularly novel. I am also not sure how using one CPU and multiple threads compares to using multiple CPUs, with one copy of the environment in each. Do any hardware experts want to chime in? The reason why I am wondering about this is because the Distributed PER algorithm gets massively better performance on the Atari benchmarks compared to prior work. So something must be working better than before. Well, they probably used far more CPUs; they use 360 in the paper (and only one GPU, interestingly).

They evaluate their framework, Ape-X, on DQN and DPG, so the algorithms are called Ape-X DQN and Ape-X DPG. The experiments are extensive, and they even benchmark with Rainbow! At the time of the paper submission to ICLR, Rainbow was just an arXiv preprint, under review at AAAI 2018, where (unsurprisingly) it got accepted. Thus, with Distributed PER being submitted to ICLR 2018, there was no ambiguity on who the authors were. :-)

The results suggest that Ape-X DQN is better than Rainbow … with DOUBLE (!!) the performance, and using half of the runtime! That is really impressive, because Rainbow was considered to be the most recent state of the art on Atari. As a caveat, it can be difficult to compare these algorithms, because with wall clock time, an algorithm with more parallelism (i.e., Ape-X versions) can get more samples, so it’s not clear if this is a fair comparison. Furthermore, Ape-X can certainly be combined with Rainbow. DeepMind only included a handful of Rainbow’s features in Ape-X DQN, so presumably with more of those features, they would get even better results.

Incidentally, when it comes to state of the art on Atari games, we have to be careful about what we mean. There are at least three ways you can define it:

• One algorithm, applied to each game independently, with no other extra information beyond the reward signal.
• One algorithm, applied to each game independently, but with extra information. This information might be demonstrations, as in DQfD.
• One algorithm, with ONE neural network, applied to all games in a dependent manner. Some examples are IMPALA (ICML 2018), PopArt (AAAI 2019) and Recurrent Distributed PER (ICLR 2019).

In the first two cases, we assume no hyperparameter tuning per game.

In this paper, we are using the first item above. I believe that state of the art in this domain proceeded as follows: DQN -> Double DQN (i.e., DDQN) -> Prioritized DDQN -> Dueling+Prioritized DDQN -> A3C -> Rainbow -> Distributed PER (i.e., Ape-X DQN). Is that it? Have there been any updates after this paper? I suppose I could leave out the dueling architectures one because that came at the same time the A3C paper did (ICML 2016), but I wanted to mention it anyway.

## Ape-X DQfD

The motivation for this paper was as follows:

Despite significant advances in the field of deep Reinforcement Learning (RL), today’s algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently. In this paper, we propose an algorithm that addresses each of these challenges and is able to learn human-level policies on nearly all Atari games.

The algorithm they propose is Ape-X DQfD. But, as suggested in the abstract, there are actually three main challenges they attempt to address, and the “DQfD” portion is mainly for the exploration one; the demonstrations reduce the need for the agent to randomly explore. The other algorithmic aspects do not appear to be as relevant to DQfD so I will not review them in my usual excess detail. Here they are, briefly:

• They propose a new Bellman operator with the goal of reducing the magnitude of the action-value function. This is meant to be a “better solution” than clipping rewards into $[-1,1]$. See the paper for the mathematical details; the main idea is to insert a function $h(z)$ and its inverse $h^{-1}(z)$ into the Bellman operator. I kind of see how this works, but it would be nice to understand the justification better.

• They want to reason over longer horizons, which they can do by increasing the discount factor from the usual $\gamma=0.99$ to $\gamma=0.999$. Unfortunately, this can cause instability since, according to them, it “[…] decreases the temporal difference in value between non-rewarding states.” I don’t think I understand this justification. I believe they argue that states $(x_i,a_i)$ and $(x_{i+1},a_{i+1})$ should not actually have the same value function, so in some odd sense, generalization of the value function is bad.

$L_{TC}\Big(\theta; \theta^{(k-1)} \Big) = \sum_{i=1}^N p_i \mathcal{L} \Big( f_{\theta}(x_i',a_i') - f_{\theta^{(k-1)}}(x_i',a_i') \Big)$

with $k \in \mathbb{N}$ the current iteration number.

Now let’s get to what I really wanted to see: the DQfD extensions. DeepMind applies the following algorithmic differences:

• They do not have a “pre-training” phase, so the agent learns from its self-generated data and the expert data right away.

• They only apply the imitation loss to the best expert data. Huh, that seems to partially contradict the DQfD result which argued that the imitation loss was a critical aspect of the algorithm. Another relevant change is to increase the margin value from 0.8 to $\sqrt{0.999}$. I am not sure why.

• They use many actors to collect experience in parallel. To be precise on the word “many,” they use 128 parallel actors.

• They only use a fixed 75-25 split of agent-to-expert data.

At first, I was surprised at the last change I wrote above, because I had assumed that adding priorities to both the agent and expert data and putting them in the same replay buffer would be ideal. After all, the Ape-X paper (i.e., the Distributed Prioritized Experience Replay paper that I just discussed) said that prioritization was the most important ingredient in Rainbow:

In an ablation study conducted to investigate the relative importance of several algorithmic ingredients (Hessel et al., 2017), prioritization was found to be the most important ingredient contributing to the agent’s performance.

But after a second read, I realized what the authors meant. They still use priorities for the expert and actor samples, but these are in separate replay buffers. They are then sampled and then combined together at the 75-25 ratio. They explain why in the rebuttal:

Simply adding more actors to the original DQfD algorithm is a poor choice of baseline. First, DQfD uses a single replay buffer and relies on prioritization to select expert transitions. This works fine when there is a single actor process feeding data into the replay buffer. However, in a distributed setup (with 128 actors), the ratio of expert to actor experience drops significantly and prioritization no longer exposes the learner to expert transitions. Second, DQfD uses a pre-training phase during which all actor processes are idle. This wastes a lot of compute resources when running a decent number of actors. Many of our contributions help eliminate these shortcomings.

This makes sense. Did I mention that I really like ICLR’s OpenReview system?

The algorithm named, as mentioned earlier, is Ape-X DQfD because it uses ideas from the Ape-X paper, titled “Distributed Prioritized Experience Replay” (ICLR 2018). Yeah, more distributed systems!! I am seeing a clear trend.

What do the experiments show and suggest? Ablation studies are conducted on six carefully-selected games, and then more extensive results are reported with the same 42 (not 57) games from DQfD for which they have demonstrator data. It looks better than prior methods! Still, this is with the usual caveats that it’s near-impossible to reproduce these even if we had the same computational power as DeepMind.

In addition, with DQfD, we assume privileged information in the form of expert transitions, so it might not be fair to evaluate it with other algorithms that only get data from exploration, such as Rainbow DQN or Ape-X DQN.

After reading this paper, there are still some confusing aspects. I agree with one of the reviewer comments that the addition of demonstrations (forming Ape-X DQfD) feels like an orthogonal contribution. Overall, the paper looks like it was a combination of some techniques, but perhaps was not as rigorous as desired? Hopefully DeepMind will expand Ape-X DQfD in its own paper, as I would be interested in seeing their follow-up work.

## Kickstarting Deep Reinforcement Learning

Given the previous discussion on DQfD and combining IL with RL, I thought it would be worthwhile to comment on the Kickstarting Deep Reinforcement Learning paper. It’s closely related; again, we assume we have some pre-trained “expert” policy, and we hope to accelerate the learning process of a new “student” agent via a combination of IL and RL. This paper appears to be the policy gradient analogue to DQfD. Rather than use a shared experience replay buffer, the student incorporates an extra policy distillation loss to the normal RL objective of maximizing reward. Policy distillation (see my earlier blog post about that here) cannot be the sole thing the agent considers. Otherwise, if the agent did not consider the usual environment reward, it would be unlikely to surpass performance of the teacher.

At a high level, here’s how DeepMind describes it:

The main idea is to employ an auxiliary loss function which encourages the student policy to be close to the teacher policy on the trajectories sampled by the student. Importantly, the weight of this loss in the overall learning objective is allowed to change over time, so that the student can gradually focus more on maximising rewards it receives from the environment, potentially surpassing the teacher (which might indeed have an architecture with less learning capacity).

The paper also says one of the two main prior works is “population-based training” but that part is arguably more of an orthogonal contribution, as it is about how to effectively use many agents training in parallel to find better hyperparameters. Thus, I will focus only on the auxiliary loss function aspect.

Here is a generic loss function that a “kickstarted” student would use:

$\mathcal{L}_{\rm kick}^k(\omega, x, t) = \mathcal{L}_{\rm RL}(\omega, x, t) + \lambda_k H(\pi_T(a | x_t) \| \pi_S(a | x_t, \omega))$

where $\mathcal{L}_{\rm RL}(\omega, x, t)$ is the usual RL objective formulated as a loss function, $H(\pi_T(a | x_t) || \pi_S(a | x_t, \omega))$ is the cross entropy between a student and a teacher’s output layers, and $\lambda_k$ is a hyperparameter corresponding to optimization iteration $k$, which decreases over the course of training so that the students gradually focuses less on imitating the teacher. Be careful about the notation: $x$ implies a trajectory generated by the student (i.e., the “behavioral” policy) whereas $x_t$ is a single state. In the context of the above notation, it is a state from a teacher trajectory that was generated ahead of time. We know this because it is inside the cross entropy term.

This is a general framework, so there are more specific instances of it depending on which RL algorithm is used. For example, here is how to use a kickstarted A3C loss:

\begin{align} \mathcal{L}_{\rm a3c-kick}^k(\omega, x, t) &= (r_t + \gamma v_{t+1} - V(x_t | \theta)) \log \pi_S(a_t|x_t, \omega) - \beta H(\pi_S(a|x_t,\omega)) + \\ & \lambda_k H(\pi_T(a | x_t) \| \pi_S(a | x_t, \omega)) \end{align}

where here, $H$ is now used as simply the policy entropy (first time) of the student, and then the cross entropy (second time). As usual, $V$ is a parameterized value function which plays the role of the critic. Unfortunately, the notation for the states can be a bit confusing and conflates the student and teacher states from their respective trajectories. If I were in charge of the notation, I would use $x_t^{S}$ and $x_t^{T}$ for student and teacher states, respectively.

In practice, to get better results, DeepMind uses the Importance Weighted Actor-Learner Architecture (IMPALA) agent, which extends actor-critic methods to a distributed setting with many workers. Does anyone see a trend here?

The experiments are conducted on the DMLab-30 suite of environments. Unfortunately I don’t have any intuition on this benchmark. It would be nice to test this out but the amount of compute necessary to even reproduce the figures they have is insane.

Since it was put on arXiv last year, DeepMind has not updated the paper. There is still a lot to explore here, so I am not sure why — perhaps it was rejected by ICML 2018, and then the authors decided to move on to other projects? More generally, I am interested in seeing if we have explored the limits of accelerating student training via a set of teacher agents.

# Concluding Thoughts

This post reviewed three related works to the Deep Q-Learning from Demonstrations paper. The trends that I can see are this:

• If we have expert (more generally, “demonstrator”) data, we should use it. But we have to take care to ensure that our agents can effectively utilize the expert data, and eventually surpass expert performance.

• Model-free reinforcement learning is incredibly sample inefficient. Thus, distributed architectures — which can collect many samples in parallel — result in the best performance in practice.