Reframing Reinforcement Learning as Sequence Modeling with Transformers?

The Transformer Network, developed by Google and presented in a NeurIPS 2017 paper, is one of the few papers that can truly claim to have fundamentally transformed (pun intended) the field of Artificial Intelligence. Transformer Networks have become the foundation of some of the most dramatic performance advances in Natural Language Processing (NLP). Two prominent examples are Google’s BERT model, which uses a bidirectional Transformer, and OpenAI’s line of GPT models, which uses a unidirectional Transformer. Both papers have substantially helped out their respective companies’ bottom line: BERT has boosted Google’s search capabilities to new tiers and OpenAI uses GPT-3 for automatic text generation in their first commercial product .

For a solid understand of Transformer Networks, it is probably best to read the original paper and try out sample code. However, the Transformer Network paper has also spawned a seemingly endless series of blog posts and tutorial articles, which can be solid references (though with high variance in quality). Two of my favorite posts are from well-known bloggers Jay Alammar and Lilian Weng, who serve as inspirations for my current blogging habits. Of course, I am also guilty of jumping on this bandwagon, since I wrote a blog post on Transformers a few years ago.

Transformers have changed the trajectory of NLP and other fields such as protein modeling (e.g., the MSA transformer) and computer vision. OpenAI has an ICML 2020 paper which introduces Image-GPT, and the name alone should be self-explanatory. But, what about the research area I focus on these days, robot learning? It seems like Transformers have had less impact in this area. To be clear, researchers have already tried to replace existing neural networks used in RL with Transformers, but this does not fundamentally change the nature of the problem, which is consistently framed as a Markov Decision Process where states follow the Markovian property of being a function of only the prior state and action.

That might now change. Earlier this month, two groups in BAIR released arXiv preprints that use Transformers for RL, and which do away with MDPs and treat RL as one big sequence modeling problem. They propose models called Decision Transformer and Trajectory Transformer. These have not yet been peer-reviewed, but judging from the format, it’s likely that both are under review for NeurIPS. Let’s dive into the papers, shall we?

Decision Transformer

This paper introduces the Decision Transformer, which takes a particular trajectory representation as input, and outputs action predictions at training time, or the actual actions at test time (i.e., evaluation).

First, how is a trajectory represented? In RL, these are typically a sequence of states, actions, and rewards. In this paper, however, they consider the return to go:

\[\hat{R}_t = \sum_{t'=t}^{T} r_{t'}\]

resulting in the full trajectory representation of:

\[\tau = (\hat{R}_1, s_1, a_1, \hat{R}_2, s_2, a_2, \ldots, \hat{R}_T, s_T, a_t)\]

This already raises the question of why this representation is chosen. The reason is that at test time, the Decision Transformer must be paired up with a desired performance, which is cumulative episodic return. Given that as input, after each time step, the agent gets the per-time step reward from the environment emulator, and decreases the desired performance by that amount. Then, this revised desired performance value is passed again as input, and the process repeats. The immediate question I had after this was whether it would be possible to predict the return-to-go accurately, and if the Decision Transformer could extrapolate beyond the best return-to-go in the training data. Spoiler alert: the paper reports experiments with this, finding a strong correlation between predicted and actual return, and it is possible to extrapolate beyond the best return in the data, but only by a little bit. That’s fair, it would be unrealistic to assume it could get any return-to-go feasible from the environment emulator.

The input to the Decision Transformer is a subset of the trajectory $\tau$ consisting of the $K$ most recent time steps, each of which consists of a tuple with three items as noted above (the return-to-go, state, and action). Note how this differs from a DQN-style method, which for each time step, takes in 4 stacked game frames but does not take in rewards or prior actions as input. Furthermore, in this paper, Decision Transformers use values such as $K=30$, so they consider a longer history.

The output of Decision Transformer simply requires predicting an action (during training), so it can be trained with the usual cross-entropy or mean square error loss functions, depending on whether the action is discrete or continuous.

Now, what is the architecture for predicting or generating actions? Decision Transformers use GPT, which is an auto-regressive model which means it handles probabilities of the form $p(x_t | x_{t-1}, \ldots, x_1)$ where the prediction of something at a current time is conditioned on all prior data. GPT uses this to generate (that’s what the “G” stands for) by sampling the $x_t$ term. In my notation of the $x_i$ terms, imagine all of those represent data tuples of (return-to-go, state, action) – that’s what the GPT model deals with, and it produces the next predicted tuple. Well, technically they only need to predict the action, but I wonder if state prediction could be useful? From communicating with the authors, they didn’t get much performance benefit from predicting states, but it is doable.

There are also various embedding layers applied on the input before it is passed to the GPT model. I highly recommend looking at Algorithm 1 in the paper, which has it in nicely written pseudocode. The Appendix also clarifies the code bases that they build upon, and both are publicly available. Andrej Karpathy’s miniGPT code looks nice and is self-contained.

That’s it! Notice how the Decision Transformer does not do bootstrapping to estimate value functions.

The paper evaluates on a suite of offline RL tasks, using environments from Atari (discrete control), from D4RL (continuous control), and from a “Key-to-Door” task. Fortunately for me, I had recently done a lot of reading on offline RL, and I even wrote a survey-style blog post about it a few months ago. The Decision Transformer is not specialized towards offline RL. It just happens to be the problem setting the paper considers, because not only is it very important, it is also a nice fit in that (again) the Decision Transformer does not perform bootstrapping, which is known to cause diverging Q-values in many offline RL contexts.

The results suggest that Decision Transformer is on par with state-of-the-art offline RL algorithms. It is a little worse on Atari, and a little better on D4RL. It seems to do a lot better on the Key-to-Door task but I’m not sufficiently familiar with that benchmark. However, since the paper is proposing an approach fundamentally different from most RL methods, it is impressive to get similar performance. I expect that future researchers will build upon the Decision Transformer to improve its results.

Trajectory Transformer

Now let us consider the second paper, which introduces the Trajectory Transformer. As with the prior paper, it departs from the usual MDP assumptions, and it also does not require dynamic programming or bootstrapped estimates. Instead, it directly uses properties from the Transformer to encode all the ingredients it needs for a wide range of control and decision-making problems. As it borrows techniques from language modeling, the paper argues that the main technical innovation is understanding how to represent a trajectory. Here, the trajectories $\tau$ are represented as:

\[\tau = \{ \mathbf{s}_t^0, \mathbf{s}_t^{1}, \ldots, \mathbf{s}_t^{N-1}, \mathbf{a}_t^0, \mathbf{a}_t^{1}, \ldots, \mathbf{a}_t^{M-1}, r_t \}_{t=0}^{T-1}\]

My first reaction was that this looks different than the trajectory representation for Decision Transformers. There’s no return-to-go written here, but this is a little misleading. The Trajectory Transformer paper tests three decision-making settings: (1) imitation learning, (2) goal-conditioned RL, and (3) offline RL. The Decision Transformer paper focuses on applying the framework to offline RL only. For offline RL, the Trajectory Transformer actually uses the return-to-go as an extra component in each data tuple in $\tau$. So I don’t believe there is any fundamental difference in terms of the trajectory consisting of states, actions, and return-to-go, though the Trajectory Transformer seems to also take in the current scalar $r_t$ as input, so that could be one difference, and it also appears to use a discount factor in the return-to-go. Both seem minor.

Perhaps a more fundamental difference is with discretization. The Decision Transformer paper doesn’t mention discretization, and from contacting the authors, I confirm they did not discretize. So for continuous states and actions, the Decision Transformer likely just represents them as vectors in $\mathbb{R}^d$ for some suitable $d$ representing the state or action dimension. In contrast, Trajectory Transformers use discretized states and actions as input, and the paper helpfully explains how the indexing and offsets work. While this may be inefficient, the paper states, it allows them to use a more expressive model. My intuition for this phrase comes from histograms — in theory, histograms can represent arbitrarily complex 1D data distributions, whereas a 1D Gaussian must have a specific “bell-shaped” structure.

As with the Decision Transformer, the Trajectory Transformer uses a GPT as its backbone, and is trained to optimize log probabilities of states, actions, and rewards, conditioned on prior information in the trajectory. This enables test-time prediction by sampling from the trained model using what is known as beam search. This is another core difference between the Trajectory Transformer and Decision Transformer. The former uses beam search, the latter does not, and that’s probably because with discretization, it may be easier to do multimodal reasoning.

For quantitative results, they again test on D4RL for offline RL experiments. The results suggest that Trajectory Transformers are competitive with prior state-of-the-art offline RL algorithms. Again, as with Decision Transformers, the results aren’t significant improvements, but the fact that they’re able to get to this performance for the first iteration of this approach is impressive in its own right. They also show a nice qualitative visualization where their Trajectory Transformer can produce a long sequence of predicted trajectories of a humanoid, whereas a popular state-of-the-art model-based RL algorithm known as PETS makes significantly worse predictions.

The project website succinctly summarizes the comparisons between Trajectory Transformer and Decision Transformer as follows:

Chen et al concurrently proposed another sequence modeling approach to reinforcement learning. At a high-level, ours is more model-based in spirit and theirs is more model-free, which allows us to evaluate Transformers as long-horizon dynamics models (e.g., in the humanoid predictions above) and allows them to evaluate their policies in image-based environments (e.g., Atari). We encourage you to check out their work as well.

To be clear, the idea that Trajectory Transformer is model-based and that Decision Transformer is model-free is partly because the former predicts states, whereas the latter only predicts actions.

Concluding Thoughts

Both papers show that we can consider RL as a sequence learning problem, where Transformers can take in a long sequence of data and predict something. The two approaches can get around the “deadly triad” in RL since bootstrapping value estimates is not necessary. The use of Transformers enables building upon an extensive literature for Transformers in other fields — and it’s very extensive, despite how Transformers are only 4 years old (it has an absurd 22955 Google Scholar citations as of today)! The models use the same fundamental backbone, and I wonder if there are ways to merge the approaches. Would beam search, for example, be helpful in Decision Transformers, and would conditioning on return-to-go be helpful for Trajectory Transformer?

To reitertate, the results are not “out of this world” compared to current state-of-the-art RL using MDPs, but as a first step, these look impressive. Moreover, I am guessing that the research teams are busy extending the capabilities of these models. These two papers have very high impact potential. Assuming the research community is able to improve upon these models, this approach may even become the standard treatment for RL. I am excited to see what will come.