Combining Imitation Learning and Reinforcement Learning Using DQfD

Imitation Learning (IL) and Reinforcement Learning (RL) are often introduced as similar, but separate problems. Imitation learning involves a supervisor that provides data to the learner. Reinforcement learning means the agent has to explore in the environment to get feedback signals. This crude categorization makes sense as a start, but as with many things in life, the line between them is blurry.

I am interested in both learning problems, but am probably even more fascinated about figuring out how best to merge the techniques to get the best of both words. A landmark paper in the combination of imitation learning and reinforcement learning is DeepMind’s Deep Q-Learning from Demonstrations (DQfD), which appeared at AAAI 2018. (The paper was originally called Learning from Demonstrations for Real World Reinforcement Learning” in an earlier version, and somewhat annoyingly, follow-up work has cited both versions of the title.) In this post, I’ll review the DQfD algorithm and the most relevant follow-up work.

The problem setting of DQfD is when data is available from a demonstrator (or “expert”, or “teacher”) for some task, and using this data, we want to train a new “learner” or “student” agent to learn faster as compared to vanilla RL. DeepMind argues that this is often the case, or at least more common than when a perfectly accurate simulator is available:

While accurate simulators are difficult to find, most of these problems have data of the system operating under a previous controller (either human or machine) that performs reasonably well.

Fair enough. The goal of DQfD is to use this (potentially suboptimal!) demonstration data to pre-train an agent, so that it can then perform RL right away and be effective. But the straightforward solution of performing supervised learning on the demonstration data and then applying RL is not ideal. One reason is that it makes sense to use the demonstration data continuously throughout training as needed, rather than ignoring it. This implies that the agent still needs to do supervised learning during the RL phase. The RL phase, incidentally, should allow the learner to eventually out-perform the agent who provided the data. Keep in mind that we also do not want to force a supervisor to be present to label the data, as in DAGGER — we assume that we are stuck with the data we have at the start.

So how does this all work? DQfD cleverly defines a loss function $J(Q)$ based on the current $Q$ network, and applies it in its two phases.

In Phase I, the agent does not do any exploration. Using the demonstration data, it performs gradient descent to minimize $J(Q)$. DQfD, like DQN, uses an experience replay buffer which stores the demonstration data. This means DQfD is off-policy as the data comes from the demonstrator, not the agent’s current behavioral policy. (Incidentally, I often refer to “teacher samples in a replay buffer” to explain to people why DQN and Q-Learning methods are off-policy.) The demonstration data is sampled with prioritization using prioritized experience replay.
In Phase II: the agent explores in the environment to collect its own data. It is not explicitly stated in the paper, but upon communication with the author, I know that DQfD follows an $\epsilon$-greedy policy but keeps $\epsilon=0.01$, so effectively there is no exploration and the only reason for “0.01” is to ensure that the agent cannot get stuck in the Atari environments. Key to this phase is that the agent still makes use of the demonstration data. That is, the demonstration data does not go away, and is still occasionally sampled with prioritization along with the new self-generated data. Each minibatch may therefore consist of a mix of demonstration and self-generated data, and these are used for gradient updates to minimize the same loss function $J(Q)$.

The main detail to address is the loss function $J(Q)$, which is:

\[J(Q) = J_{DQ}(Q) + \lambda_1 J_n(Q) + \lambda_2 J_E(Q) + \lambda_3 J_{L_2}(Q)\]

where:

$J_{DQ}(Q)$ is the usual one-step Double DQN loss.
$J_n(Q)$ is the $n$-step (Double) DQN loss — from communication with the author, they still used the Double DQN loss, even though it doesn’t seem to be explicitly mentioned. Perhaps the Double DQN loss isn’t as important in this case, because I think having longer returns with a discount factor $\gamma$ decreases the need to have a highly accurate bootstrapped Q-value estimate.
$J_E(Q)$ is a large margin classification loss, which I sometimes refer to as “imitation” or “supervised” loss. It is defined as
\[J_E(Q) = \max_{a \in A}\Big[ Q(s,a) + \ell(a,a_E) \Big] - Q(s,a_E)\]
where $\ell(a,a_E)$ is a “margin” loss which is 0 if $a=a_E$ and positive otherwise (DQfD used 0.8). This loss is only applied on demonstrator data, where an $a_E$ actually exists!
$J_{L_2}(Q)$ is the usual $L_2$ regularization loss on the Q-network’s weights.

All these are combined into $J(Q)$, which is then minimized via gradient descent on minibatches of samples. In Phase I, the samples all come from the demonstrator. In Phase II, they could be a mixture of demonstrator and self-generated samples, depending on what gets prioritized. The lambda terms are hyperparameters to weigh different aspects of the loss function. DeepMind set them at $\lambda_1 = 1, \lambda_2 = 1$, and $\lambda_3 = 10^{-5}$, with the obvious correction factor of $\lambda_2=0$ if the sample in question is actually self-generated data.

The large margin imitation loss is simple, but is worth thinking about carefully. Why does it work?

Let’s provide some intuition. Suppose we have a dataset $\{s_0,a_0,s_1,a_1,\ldots\}$ from a demonstrator. For simplicity, suppress the time subscript over $a$ — it’s understood to be $a_t$ if we’re considering state $s_t$ in our Q-values. Furthermore, we’ll suppress $\theta$ in $Q_\theta(s,a)$, which represents the parameters of our Q-network, which tries to estimate the state-action values. If our action space has four actions, $\mathcal{A} = \{0,1,2,3\}$, then the output of the Q-network using the following size-3 minibatch (during training) is this $3\times 4$ matrix:

\[\begin{bmatrix} Q(s_0, 0) & Q(s_0, 1) & Q(s_0, 2) & Q(s_0, 3) \\ Q(s_1, 0) & Q(s_1, 1) & Q(s_1, 2) & Q(s_1, 3) \\ Q(s_2, 0) & Q(s_2, 1) & Q(s_2, 2) & Q(s_2, 3) \\ \end{bmatrix}.\]

You can think of DQfD as augmenting that matrix by the margin $\ell$, and then apply the max operator to get this during the training process:

\[\max\left( \begin{bmatrix} Q(s_0, 0) + \ell & Q(s_0, 1) + \ell & Q(s_0, 2) + \ell & Q(s_0, 3) \\ Q(s_1, 0) + \ell & Q(s_1, 1) & Q(s_1, 2) + \ell & Q(s_1, 3) + \ell \\ Q(s_2, 0) + \ell & Q(s_2, 1) +\ell & Q(s_2, 2) & Q(s_2, 3) + \ell \\ \end{bmatrix} \right) - \begin{bmatrix} Q(s_0, 3) \\ Q(s_1, 1) \\ Q(s_2, 2) \end{bmatrix}\]

where the “max” operator above is applied on each row of the matrix of Q-values. In the above, the demonstrator took action $a_0=3$ when it was in state $s_0$, action $a_1=1$ when it was in state $s_1$, and action $a_2=2$ when it was in state $s_2$. To reiterate, this is only applied on the demonstrator data.

The loss is designed like this to make the Q-values of non-demonstrator state-action values to be below the Q-values of the state-action values corresponding to what the demonstrator actually took. Let’s go through the three possible cases.

Suppose for a given state (e.g., $s_0$ from above), our Q-network thinks that some action (e.g., $a_0=0$) has the highest Q-value despite it not being what the expert actually took. Then
\[Q(s_0,0) + \ell - Q(s_0,3) > \ell\]
and our loss is high.
Suppose for the same state $s_0$, our Q-network thinks that the action the demonstrator took ($a_0=3$ in this case) has the highest Q-value. But, after adding $\ell$ to the other actions’ Q-values, suppose that the maximum Q-value is something else, i.e., $Q(s_0,k)$ but where $k \ne 3$. Then, the loss is in the range
\[Q(s_0,k)+\ell-Q(s_0,3) \in (0,\ell]\]
which is better than the previous case, but not ideal.
Finally, suppose we are in the previous case, but even after adding $\ell$ to the other Q-values, the Q-value corresponding to the state-action pair the demonstrator took is still the highest. Then the imitation loss is
\[Q(s_0,3)-Q(s_0,3)=0\]
which is clearly the minimum possible value.

An alternative to the imitation loss could be the cross entropy loss to try and make the learner choose the action most similar to what the expert took, but the DQfD paper argues that the imitation loss is better.

Besides the imitation loss, the other thing about the paper that I was initially unsure about was how to implement the prioritization for the 1-step and $n$-step returns. After communicating with the author, I see that probably the easiest way is to treated the 1 and $n$-step returns as separate entries in the replay buffer. That way, each of these items has independent sampling priorities.

Said another way, instead of our samples in the replay buffer only coming from tuples of the form:

\[(s_t, a_t, r_t, d_t, s_{t+1})\]

in DQfD, we will also see these terms below which constitute half of the replay buffer’s elements:

\[(s_t, a_{t}, r_t+\gamma r_{t+1}+\ldots + \gamma^{n-1}r_{t+n-1}, d_{t+n-1}, s_{t+n})\]

This makes sense, though it is tricky to implement. Above, I’ve included the “done masks” that are used to determine whether we should use a bootstrapped estimate or not. But keep in mind that if an episode actually ends sooner than $n$ steps ahead, the done mask must be applied at the point where the episode ends!

One final comment on replay buffers: if you’re using prioritized experience replay, it’s probably easiest to keep the demonstrator and self-generated data in the same experience replay buffer class, and then build on top of the prioritized buffer class as provided by OpenAI baselines. To ensure that the demonstrator data is never overwritten, you can always “reserve” the first $K$ indices in a list of the transition tuples to be for the teacher.

The paper’s experimental results show that DQfD learns faster initially (arguably one of the most critical periods) and often has higher asymptotic performance. You can see the results in the paper.

I like the DQfD paper very much. Otherwise, I would not have invested significant effort into understanding it. Nonetheless, I have at least two concerns:

The ablation studies seem very minor, or at least compared to the imitation loss experiments. I’m not sure why? It seems like DeepMind prioritized their other results, which probably makes sense. In theory, I think all components in the loss function $J(Q)$ matter, with the possible exception of the $L_2$ loss (see next point).
The main logic of the $L_2$ loss is to avoid overfitting on the demonstration data. But is that really the best way to prevent overfitting? And in RL here, presumably if the agent gets better and better, it doesn’t need the demonstration data any more. But then we would be effectively applying $L_2$ regularization on a “vanilla” RL agent, and that might be riskier — regularizing RL seems to be an open question.

DeepMind has extended DQfD in several ways. Upon a literature search, it seems like two relevant follow-up works are:

Distributed Prioritized Experience Replay, with the OpenReview link here for ICLR 2018. The main idea of this paper is to scale up the experience replay data by having many actors collect experience. Their framework is called Ape-X, and they claim that Ape-X DQN achieves a new state of the art performance on Atari games. This paper is not that particularly relevant to DQfD, but I include it here mainly because a follow-up paper (see below) used this technique with DQfD.
Observe and Look Further: Achieving Consistent Performance on Atari. This one appeared to have been rejected from ICLR 2019, unfortunately, and I’m a little concerned about the rationale. If the reviewers say that there are not enough experiments, then what does this mean for people who do not have as much compute? In any case, this paper proposes the Ape-X DQfD algorithm, which as one might expect combines DQfD with the distributed prioritized experience replay algorithm. DeepMind was able to increase discount factor to be $\gamma=0.999$, which they argue allows for an order of magnitude more downstream reward data. Some other hyperparameter changes: the large margin value increases to $\sqrt{0.999}$, and there is now no prioritization, just a fixed proportion of student-teacher samples in each minibatch. Huh, that’s interesting. I would like to understand this rationale better.

Combining IL and RL is a fascinating concept. I will keep reading relevant papers to remain at the frontiers of knowledge.

Update May 2019: I have a blog post which describes some follow-up work in more detail.