Batch Constrained Deep Reinforcement Learning
An interesting paper that I am reading is OffPolicy Deep Reinforcement Learning without Exploration. You can find the latest version on arXiv, where it clearly appears to be under review for ICML 2019. An earlier version was under review at ICLR 2019 under the earlier title Where OffPolicy Deep Reinforcement Learning Fails. I like the research contribution of the paper, as it falls in line with recent work on how to make deep reinforcement learning slightly more practical. In this case, “practical” refers to how we have a batch of data, from perhaps a simulator or an expert, and we want to train an agent to learn from it without exploration, which would do wonders for safety and sample efficiency.
As is clear from the abstract, the paper introduces the batchconstrained RL algorithm:
We introduce a novel class of offpolicy algorithms, batchconstrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to onpolicy with respect to a subset of the given data.
This is clear. We want the set of states the agent experiences to be similar to the set of states from the batch, which might be from an expert (for example). This reminded me of the DART paper (expanded in a BAIR Blog post) that the AUTOLAB developed:
 DART is about applying noise to expert states, so that behavior cloning can see a “wider” distribution of states. This was an imitation learning paper, but the general theme of increasing the variety of states seen has appeared in past reinforcement learning research.
 This paper, though, is about restricting the actions so that the states the agent sees match those of the expert’s by virtue of taking similar actions.
Many of the most successful modern (i.e., “deep”) offpolicy algorithms use some variant of experience replay, but the authors claim that this only works when the data in the buffer is correlated with the data induced by the current agent’s policy. This does not work if there is what the authors define as extrapolation error, which is when there is a mismatch between the two datasets. Yes, I agree. Though experience replay is actually designed to break correlation among samples, the most recent information is put into the buffer, bumping older stuff out. By definition, that means some of the data in the experience replay is correlated with the agent’s policy.
But more generally, we might have a batch of data where nothing came from the current agent’s policy. The more I think about it, the more an action restriction makes sense. With function approximation, unseen stateaction pairs $(s,a)$ might be more or less attractive than seen pairs. But, aren’t there more ways to be bad than there are to be good? That is, it’s easy to get terrible reward in environments, but harder to get the highest reward, which one can verify by mathematically assigning the probabilities of each random sequence of actions. This paper is about restricting the actions so that we keep funneling the agent towards the highquality states in the batch.
To be clear, here’s what “batch reinforcement learning” means, and its advantages:
Batch reinforcement learning, the task of learning from a fixed dataset without further interactions with the environment, is a crucial requirement for scaling reinforcement learning to tasks where the data collection procedure is costly, risky, or timeconsuming.
You can also view this through the lens of imitation learning, because the simplest form, behavior cloning, does not require environment interaction.^{1} Furthermore, one of the fundamental aspects of reinforcement learning is precisely environment interaction! Indeed, this paper benchmarks with behavior cloning, and freely says that “Our algorithm offers a unified view on imitation and offpolicy learning.”^{2}
Let’s move on to the technical and algorithmic contribution, because I’m rambling too much. Their first foray is to try and redefine the Bellman operator in finite, discrete MDPs in the context of reducing extrapolation error so that the induced policy will visit the stateaction pairs that more closely correspond with the distribution of stateaction pairs from the batch.
A summary of the paper’s theory is that batchconstrained learning still converges to an optimal policy for deterministic MDPs. Much of the theory involves redefining or inducing a new MDP based on the batch, and then deferring to standard Qlearning theory. I wish I had time to go through some of those older classical papers, such as this one.
For example, the paper claims that normal Qlearning on the batch of data will result in an optimal value function for an alternative MDP, $M_{\mathcal{B}}$, based on the batch $\mathcal{B}$. A related and important definition is the tabular extraploation error $\epsilon_{\rm MDP}$, defined as discrepancy between the value function computed with the batch versus the value function computed with the true MDP $M$:
\[\epsilon_{\rm MDP}(s,a) = Q^\pi(s,a)  Q_{\mathcal{B}}^\pi(s,a)\]This can be computed recursively using a Bellmanlike equation (see the paper for details), but it’s easier to write as:
\[\epsilon_{\rm MDP}^\pi = \sum_{s} \mu_\pi(s) \sum_a \pi(as) \epsilon_{\rm MDP}(s,a)\]By using the above, they are able to derive a new algorithm: BatchConstrained Qlearning (BCQL) which restricts the possible actions to be in the batch:
\[Q(s,a) \leftarrow (1  \alpha ) Q(s,a) + \alpha\left( r + \gamma \left\{ \max_{a' \;{\rm s.t.}\; (s',a') \in \mathcal{B}} Q(s',a') \right\} \right)\]Next, let’s introduce their practical algorithm for highdimensional, continuous control: BatchConstrained deep Qlearning (BCQ). It utilizes four parameterized networks.

A Generative model $G_\omega(s)$ which, given the state as input, produces an action. Using a generative model this way assumes we pick actions using:
\[\operatorname*{argmax}_{a} \;\; P_{\mathcal{B}}^G(as)\]or in other words, the most likely action given the state, with respect to the data in the batch. This is difficult to model in high dimensional continuous control environments, so they approximate it with a variational autoencoder. This is trained along with the policy parameters during each for loop iteration.

A Perturbation model $\xi_\phi(s,a,\Phi)$ which aims to “optimally perturb” the actions, so that they don’t need to sample too much from $G_\omega(s)$. The perturbation applies noise in $[\Phi,\Phi]$. It is updated via a deterministic policy gradient rule:
\[\phi \leftarrow \operatorname*{argmax}_\phi \;\; \sum_{(s,a) \in \mathcal{B}} Q_\theta\Big( s, a+\xi_\phi(s,a,\Phi)\Big)\]The above is a maximization problem over a sum of Qfunction terms. The Qfunction is differentiable as we parameterize it with a deep neural network, and stochastic gradient descent methods will work with stochastic inputs. I wonder, is the perturbation model overkill? Is it possible to do a cross entropy method, like what two of these papers do for robotic grasping?

Two Qnetworks $Q_{\theta_1}(s,a)$ and $Q_{\theta_2}(s,a)$, to help push their policy to select actions that lead to “more certain data.” This is based on their ICML paper last year, which proposed the (now popular!) TwinDelayed DDPG (TD3) algorithm. OpenAI’s SpinningUp has a helpful overview of TD3.
All networks other than the generative model also consist of target networks, following standard DDPG practices.
All together, their algorithm uses this policy:
\[\pi(s) = \operatorname*{argmax}_{a_i+\xi_\phi(s,a_i,\Phi)} \;\; Q_\theta\Big(s, a_i+\xi_\phi(s,a_i,\Phi)\Big) \quad \quad \{a_i \sim G_\omega(s) \}_{i=1}^n\]To be clear, they approximate this maximization by sampling $n$ actions each time step, and picking the best one. The perturbation model, as stated earlier, increases the diversity of the sampled actions. Once again, it would be nice to confirm that this is necessary, such as via an experiment that shows the VAE collapses to a mode. (I don’t see justification in the paper or the appendix.)
There is a useful interpretation of how this algorithm is a continuum between behavior cloning (if $n=1$ and $\Phi=0$) and Qlearning ($n\to \infty$ and $\Phi \to a_{\rm max}a_{\rm min}$).
All right, that was their theory and algorithm — now let’s discuss the experiments. They test with DDPG under several different conditions. They assume that there is a “behavioral DDPG” agent which generates the batch of data, for which an “offpolicy DDPG” agent learns from, without exploration. Their goal is to improve the learning of the “offpolicy DDPG.” (Don’t get confused with the actorcritic framework of normal DDPG … just think of the behavioral DDPG as the thing that generates the batch in “batchconstrained RL.”)

Final Buffer. They train the behavioral DDPG agent from scratch for 1 million steps, adding more noise than usual for extra exploration. Then all of its experience is pooled inside an experience replay. That’s the “batch”. Then, they use it to train the offpolicy DDPG agent. That offpolicy agent does not interact with the environment — it just draws samples from the buffer. Note that this will result in widespread state coverage, including potentially the early states when the behavioral agent was performing poorly.

Concurrent. This time, as the behavioral DDPG agent learns, the offpolicy one learns as well, using data from the behavioral agent. Moreover, the original behavioral DDPG agent is also learning from the same data, so both agents learn from identical datsets. To be clear: this means the agents have almost identical training settings. The only differences I can think of are: (a) noise in initial parameters and (b) noise in minibatch sampling. Is that it?

Imitation. After training the behavioral DDPG agent, they run it for 1 million steps. Those experiences are added to the buffer, from which the offpolicy DDPG agent learns. Thus, this is basically the imitation learning setting.

Imperfect Demonstrations. This is the same as the “imitation” case, except some noise is added to the data, through Gaussian noise on the states and randomness in action selection. Thus, it’s like adding more coverage to an expert data.
The experiments use … MuJoCo. Argh, we’re still using it as a benchmark. They test with HalfCheetahv1, Hopperv1, and Walker2dv1. Ideally there would be more, at least in the main part of the paper. The Appendix has some limited Pendulumv0 and Reacherv1 results. I wonder if they tried on Humanoidv1.
They actually performed some initial experiments before presenting the theory, which justifies the need to correct for extrapolation error. The most surprising fact there was that the offpolicy DDPG agent failed to match the behavioral agent even in the concurrent learning paradigm. That’s surprising!
This was what motivated their BatchConstrained deep Qlearning (BCQ) algorithm, discussed above.
As for their results, I am a little confused after reading Figure 2. They say that:
Only BCQ matches or outperforms the performance of the behavioral policy in all tasks.
Being colorblind, the BCQ and VAEBC colors look indistinguishable to me. (And the same goes for the DQN and DDPG baselines, which look like they are orange and orange, respectively.) I wish there was better color contrast, perhaps with light purple and dark blue for the former, and yellow and red for the latter. Oh well. I assume that their BCQ curve is the highest one on the rewards plot … but this means it’s not that much better than the baselines on Hopperv1 except for the imperfect demonstrations task. Furthermore, the shaded area is only half of a standard deviation, rather than one. Finally, in the imitation task, simple behavior cloning was better. So, it’s hard to tell if these are truly statistically significant results.
While I wish the results were more convincing, I still buy the rationale of their algorithm. I believe it is a valuable contribution to the research community.

More advanced forms of imitation learning might require substantial environment interaction, such as Generative Adversarial Imitation Learning. (My blog post about that paper is here.) ↩

One of the ICLR reviewers brought up that this is more of an imitation learning algorithm than it is a reinforcement learning one … ↩