An interesting paper that I am reading is Off-Policy Deep Reinforcement Learning without Exploration. You can find the latest version on arXiv, where it clearly appears to be under review for ICML 2019. An earlier version was under review at ICLR 2019 under the earlier title Where Off-Policy Deep Reinforcement Learning Fails. I like the research contribution of the paper, as it falls in line with recent work on how to make deep reinforcement learning slightly more practical. In this case, “practical” refers to how we have a batch of data, from perhaps a simulator or an expert, and we want to train an agent to learn from it without exploration, which would do wonders for safety and sample efficiency.

As is clear from the abstract, the paper introduces the batch-constrained RL algorithm:

We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.

This is clear. We want the set of states the agent experiences to be similar to the set of states from the batch, which might be from an expert (for example). This reminded me of the DART paper (expanded in a BAIR Blog post) that the AUTOLAB developed:

• DART is about applying noise to expert states, so that behavior cloning can see a “wider” distribution of states. This was an imitation learning paper, but the general theme of increasing the variety of states seen has appeared in past reinforcement learning research.
• This paper, though, is about restricting the actions so that the states the agent sees match those of the expert’s by virtue of taking similar actions.

Many of the most successful modern (i.e., “deep”) off-policy algorithms use some variant of experience replay, but the authors claim that this only works when the data in the buffer is correlated with the data induced by the current agent’s policy. This does not work if there is what the authors define as extrapolation error, which is when there is a mismatch between the two datasets. Yes, I agree. Though experience replay is actually designed to break correlation among samples, the most recent information is put into the buffer, bumping older stuff out. By definition, that means some of the data in the experience replay is correlated with the agent’s policy.

But more generally, we might have a batch of data where nothing came from the current agent’s policy. The more I think about it, the more an action restriction makes sense. With function approximation, unseen state-action pairs $(s,a)$ might be more or less attractive than seen pairs. But, aren’t there more ways to be bad than there are to be good? That is, it’s easy to get terrible reward in environments, but harder to get the highest reward, which one can verify by mathematically assigning the probabilities of each random sequence of actions. This paper is about restricting the actions so that we keep funneling the agent towards the high-quality states in the batch.

To be clear, here’s what “batch reinforcement learning” means, and its advantages:

Batch reinforcement learning, the task of learning from a fixed dataset without further interactions with the environment, is a crucial requirement for scaling reinforcement learning to tasks where the data collection procedure is costly, risky, or time-consuming.

You can also view this through the lens of imitation learning, because the simplest form, behavior cloning, does not require environment interaction.1 Furthermore, one of the fundamental aspects of reinforcement learning is precisely environment interaction! Indeed, this paper benchmarks with behavior cloning, and freely says that “Our algorithm offers a unified view on imitation and off-policy learning.”2

Let’s move on to the technical and algorithmic contribution, because I’m rambling too much. Their first foray is to try and redefine the Bellman operator in finite, discrete MDPs in the context of reducing extrapolation error so that the induced policy will visit the state-action pairs that more closely correspond with the distribution of state-action pairs from the batch.

A summary of the paper’s theory is that batch-constrained learning still converges to an optimal policy for deterministic MDPs. Much of the theory involves redefining or inducing a new MDP based on the batch, and then deferring to standard Q-learning theory. I wish I had time to go through some of those older classical papers, such as this one.

For example, the paper claims that normal Q-learning on the batch of data will result in an optimal value function for an alternative MDP, $M_{\mathcal{B}}$, based on the batch $\mathcal{B}$. A related and important definition is the tabular extraploation error $\epsilon_{\rm MDP}$, defined as discrepancy between the value function computed with the batch versus the value function computed with the true MDP $M$:

$\epsilon_{\rm MDP}(s,a) = Q^\pi(s,a) - Q_{\mathcal{B}}^\pi(s,a)$

This can be computed recursively using a Bellman-like equation (see the paper for details), but it’s easier to write as:

$\epsilon_{\rm MDP}^\pi = \sum_{s} \mu_\pi(s) \sum_a \pi(a|s) |\epsilon_{\rm MDP}(s,a)|$

By using the above, they are able to derive a new algorithm: Batch-Constrained Q-learning (BCQL) which restricts the possible actions to be in the batch:

$Q(s,a) \leftarrow (1 - \alpha ) Q(s,a) + \alpha\left( r + \gamma \left\{ \max_{a' \;{\rm s.t.}\; (s',a') \in \mathcal{B}} Q(s',a') \right\} \right)$

Next, let’s introduce their practical algorithm for high-dimensional, continuous control: Batch-Constrained deep Q-learning (BCQ). It utilizes four parameterized networks.

• A Generative model $G_\omega(s)$ which, given the state as input, produces an action. Using a generative model this way assumes we pick actions using:

$\operatorname*{argmax}_{a} \;\; P_{\mathcal{B}}^G(a|s)$

or in other words, the most likely action given the state, with respect to the data in the batch. This is difficult to model in high dimensional continuous control environments, so they approximate it with a variational autoencoder. This is trained along with the policy parameters during each for loop iteration.

• A Perturbation model $\xi_\phi(s,a,\Phi)$ which aims to “optimally perturb” the actions, so that they don’t need to sample too much from $G_\omega(s)$. The perturbation applies noise in $[-\Phi,\Phi]$. It is updated via a deterministic policy gradient rule:

$\phi \leftarrow \operatorname*{argmax}_\phi \;\; \sum_{(s,a) \in \mathcal{B}} Q_\theta\Big( s, a+\xi_\phi(s,a,\Phi)\Big)$

The above is a maximization problem over a sum of Q-function terms. The Q-function is differentiable as we parameterize it with a deep neural network, and stochastic gradient descent methods will work with stochastic inputs. I wonder, is the perturbation model overkill? Is it possible to do a cross entropy method, like what two of these papers do for robotic grasping?

• Two Q-networks $Q_{\theta_1}(s,a)$ and $Q_{\theta_2}(s,a)$, to help push their policy to select actions that lead to “more certain data.” This is based on their ICML paper last year, which proposed the (now popular!) Twin-Delayed DDPG (TD3) algorithm. OpenAI’s SpinningUp has a helpful overview of TD3.

All networks other than the generative model also consist of target networks, following standard DDPG practices.

All together, their algorithm uses this policy:

$\pi(s) = \operatorname*{argmax}_{a_i+\xi_\phi(s,a_i,\Phi)} \;\; Q_\theta\Big(s, a_i+\xi_\phi(s,a_i,\Phi)\Big) \quad \quad \{a_i \sim G_\omega(s) \}_{i=1}^n$

To be clear, they approximate this maximization by sampling $n$ actions each time step, and picking the best one. The perturbation model, as stated earlier, increases the diversity of the sampled actions. Once again, it would be nice to confirm that this is necessary, such as via an experiment that shows the VAE collapses to a mode. (I don’t see justification in the paper or the appendix.)

There is a useful interpretation of how this algorithm is a continuum between behavior cloning (if $n=1$ and $\Phi=0$) and Q-learning ($n\to \infty$ and $\Phi \to a_{\rm max}-a_{\rm min}$).

All right, that was their theory and algorithm — now let’s discuss the experiments. They test with DDPG under several different conditions. They assume that there is a “behavioral DDPG” agent which generates the batch of data, for which an “off-policy DDPG” agent learns from, without exploration. Their goal is to improve the learning of the “off-policy DDPG.” (Don’t get confused with the actor-critic framework of normal DDPG … just think of the behavioral DDPG as the thing that generates the batch in “batch-constrained RL.”)

• Final Buffer. They train the behavioral DDPG agent from scratch for 1 million steps, adding more noise than usual for extra exploration. Then all of its experience is pooled inside an experience replay. That’s the “batch”. Then, they use it to train the off-policy DDPG agent. That off-policy agent does not interact with the environment — it just draws samples from the buffer. Note that this will result in widespread state coverage, including potentially the early states when the behavioral agent was performing poorly.

• Concurrent. This time, as the behavioral DDPG agent learns, the off-policy one learns as well, using data from the behavioral agent. Moreover, the original behavioral DDPG agent is also learning from the same data, so both agents learn from identical datsets. To be clear: this means the agents have almost identical training settings. The only differences I can think of are: (a) noise in initial parameters and (b) noise in minibatch sampling. Is that it?

• Imitation. After training the behavioral DDPG agent, they run it for 1 million steps. Those experiences are added to the buffer, from which the off-policy DDPG agent learns. Thus, this is basically the imitation learning setting.

• Imperfect Demonstrations. This is the same as the “imitation” case, except some noise is added to the data, through Gaussian noise on the states and randomness in action selection. Thus, it’s like adding more coverage to an expert data.

The experiments use … MuJoCo. Argh, we’re still using it as a benchmark. They test with HalfCheetah-v1, Hopper-v1, and Walker2d-v1. Ideally there would be more, at least in the main part of the paper. The Appendix has some limited Pendulum-v0 and Reacher-v1 results. I wonder if they tried on Humanoid-v1.

They actually performed some initial experiments before presenting the theory, which justifies the need to correct for extrapolation error. The most surprising fact there was that the off-policy DDPG agent failed to match the behavioral agent even in the concurrent learning paradigm. That’s surprising!

This was what motivated their Batch-Constrained deep Q-learning (BCQ) algorithm, discussed above.

As for their results, I am a little confused after reading Figure 2. They say that:

Only BCQ matches or outperforms the performance of the behavioral policy in all tasks.

Being color-blind, the BCQ and VAE-BC colors look indistinguishable to me. (And the same goes for the DQN and DDPG baselines, which look like they are orange and orange, respectively.) I wish there was better color contrast, perhaps with light purple and dark blue for the former, and yellow and red for the latter. Oh well. I assume that their BCQ curve is the highest one on the rewards plot … but this means it’s not that much better than the baselines on Hopper-v1 except for the imperfect demonstrations task. Furthermore, the shaded area is only half of a standard deviation, rather than one. Finally, in the imitation task, simple behavior cloning was better. So, it’s hard to tell if these are truly statistically significant results.

While I wish the results were more convincing, I still buy the rationale of their algorithm. I believe it is a valuable contribution to the research community.

1. More advanced forms of imitation learning might require substantial environment interaction, such as Generative Adversarial Imitation Learning. (My blog post about that paper is here.)

2. One of the ICLR reviewers brought up that this is more of an imitation learning algorithm than it is a reinforcement learning one …