# Reframing Reinforcement Learning as Sequence Modeling with Transformers?

The **Transformer Network**, developed by Google and presented in a NeurIPS
2017 paper, is one of the few papers that can truly claim to have
fundamentally transformed (pun intended) the field of Artificial Intelligence.
Transformer Networks have become the foundation of some of the most dramatic
performance advances in Natural Language Processing (NLP). Two prominent
examples are Google’s BERT model, which uses a bidirectional Transformer,
and OpenAI’s line of GPT models, which uses a unidirectional Transformer.
Both papers have substantially helped out their respective companies’ bottom
line: BERT has boosted Google’s search capabilities to new tiers and OpenAI
uses GPT-3 for automatic text generation in their first commercial
product .

For a solid understand of Transformer Networks, it is probably best to read the
original paper *and* try out sample code. However, the Transformer Network
paper has also spawned a seemingly endless series of blog posts and tutorial
articles, which can be solid references (though with high variance in quality).
Two of my favorite posts are from well-known bloggers Jay Alammar and
Lilian Weng, who serve as inspirations for my current blogging habits. Of
course, I am also guilty of jumping on this bandwagon, since I wrote a blog
post on Transformers a few years ago.

Transformers have changed the trajectory of NLP and other fields such as protein modeling (e.g., the MSA transformer) and computer vision. OpenAI has an ICML 2020 paper which introduces Image-GPT, and the name alone should be self-explanatory. But, what about the research area I focus on these days, robot learning? It seems like Transformers have had less impact in this area. To be clear, researchers have already tried to replace existing neural networks used in RL with Transformers, but this does not fundamentally change the nature of the problem, which is consistently framed as a Markov Decision Process where states follow the Markovian property of being a function of only the prior state and action.

That might now change. Earlier this month, two groups in BAIR released arXiv
preprints that use Transformers for RL, and which do away with MDPs and treat
RL as one big sequence modeling problem. They propose models called *Decision
Transformer* and *Trajectory Transformer*. These have not yet been
peer-reviewed, but judging from the format, it’s likely that both are under
review for NeurIPS. Let’s dive into the papers, shall we?

## Decision Transformer

This paper introduces the *Decision Transformer*, which takes a particular
*trajectory representation* as input, and outputs action predictions at
training time, or the actual actions at test time (i.e., evaluation).

First, how is a trajectory represented? In RL, these are typically a sequence
of states, actions, and rewards. In this paper, however, they consider the
*return to go*:

resulting in the full trajectory representation of:

\[\tau = (\hat{R}_1, s_1, a_1, \hat{R}_2, s_2, a_2, \ldots, \hat{R}_T, s_T, a_t)\]This already raises the question of why this representation is chosen. The
reason is that at test time, the Decision Transformer must be paired up with a
*desired performance*, which is cumulative episodic return. Given that as
input, after each time step, the agent gets the per-time step reward from the
environment emulator, and decreases the desired performance by that amount.
Then, this *revised* desired performance value is passed again as input, and
the process repeats. The immediate question I had after this was whether it
would be possible to predict the return-to-go accurately, *and* if the Decision
Transformer could extrapolate beyond the best return-to-go in the training
data. Spoiler alert: the paper reports experiments with this, finding a strong
correlation between predicted and actual return, and it is possible to
extrapolate beyond the best return in the data, but only by a little bit.
That’s fair, it would be unrealistic to assume it could get *any* return-to-go
feasible from the environment emulator.

The input to the Decision Transformer is a *subset* of the trajectory $\tau$
consisting of the $K$ most recent time steps, each of which consists of a tuple
with three items as noted above (the return-to-go, state, and action). Note how
this differs from a DQN-style method, which for each time step, takes in 4
stacked game frames but does not take in rewards or prior actions as input.
Furthermore, in this paper, Decision Transformers use values such as $K=30$,
so they consider a longer history.

The output of Decision Transformer simply requires predicting an action (during training), so it can be trained with the usual cross-entropy or mean square error loss functions, depending on whether the action is discrete or continuous.

Now, what is the *architecture* for predicting or generating actions? Decision
Transformers use GPT, which is an auto-regressive model which means it handles
probabilities of the form $p(x_t | x_{t-1}, \ldots, x_1)$ where the prediction
of something at a current time is conditioned on all prior data. GPT uses this
to generate (that’s what the “G” stands for) by sampling the $x_t$ term. In my
notation of the $x_i$ terms, imagine all of those represent data tuples of
(return-to-go, state, action) – that’s what the GPT model deals with, and it
produces the next predicted tuple. Well, technically they only need to predict
the *action*, but I wonder if state prediction could be useful? From
communicating with the authors, they didn’t get much performance benefit from
predicting states, but it is doable.

There are also various embedding layers applied on the input before it is passed to the GPT model. I highly recommend looking at Algorithm 1 in the paper, which has it in nicely written pseudocode. The Appendix also clarifies the code bases that they build upon, and both are publicly available. Andrej Karpathy’s miniGPT code looks nice and is self-contained.

That’s it! Notice how the Decision Transformer does not do bootstrapping to estimate value functions.

The paper evaluates on a suite of offline RL tasks, using environments from Atari (discrete control), from D4RL (continuous control), and from a “Key-to-Door” task. Fortunately for me, I had recently done a lot of reading on offline RL, and I even wrote a survey-style blog post about it a few months ago. The Decision Transformer is not specialized towards offline RL. It just happens to be the problem setting the paper considers, because not only is it very important, it is also a nice fit in that (again) the Decision Transformer does not perform bootstrapping, which is known to cause diverging Q-values in many offline RL contexts.

The results suggest that Decision Transformer is on par with state-of-the-art offline RL algorithms. It is a little worse on Atari, and a little better on D4RL. It seems to do a lot better on the Key-to-Door task but I’m not sufficiently familiar with that benchmark. However, since the paper is proposing an approach fundamentally different from most RL methods, it is impressive to get similar performance. I expect that future researchers will build upon the Decision Transformer to improve its results.

## Trajectory Transformer

Now let us consider the second paper, which introduces the *Trajectory
Transformer*. As with the prior paper, it departs from the usual MDP
assumptions, and it also does not require dynamic programming or bootstrapped
estimates. Instead, it directly uses properties from the Transformer to encode
all the ingredients it needs for a wide range of control and decision-making
problems. As it borrows techniques from language modeling, the paper argues
that the main technical innovation is understanding how to represent a
trajectory. Here, the trajectories $\tau$ are represented as:

My first reaction was that this looks different than the trajectory
representation for Decision Transformers. There’s no return-to-go written here,
but this is a little misleading. The Trajectory Transformer paper tests *three*
decision-making settings: (1) imitation learning, (2) goal-conditioned RL, and
(3) offline RL. The Decision Transformer paper focuses on applying the
framework to offline RL only. For offline RL, the Trajectory Transformer
actually uses the return-to-go as an extra component in each data tuple in
$\tau$. So I don’t believe there is any fundamental difference in terms of the
trajectory consisting of states, actions, and return-to-go, though the
Trajectory Transformer seems to also take in the current scalar $r_t$ as input,
so that could be one difference, and it also appears to use a discount factor
in the return-to-go. Both seem minor.

Perhaps a more fundamental difference is with discretization. The Decision Transformer paper doesn’t mention discretization, and from contacting the authors, I confirm they did not discretize. So for continuous states and actions, the Decision Transformer likely just represents them as vectors in $\mathbb{R}^d$ for some suitable $d$ representing the state or action dimension. In contrast, Trajectory Transformers use discretized states and actions as input, and the paper helpfully explains how the indexing and offsets work. While this may be inefficient, the paper states, it allows them to use a more expressive model. My intuition for this phrase comes from histograms — in theory, histograms can represent arbitrarily complex 1D data distributions, whereas a 1D Gaussian must have a specific “bell-shaped” structure.

As with the Decision Transformer, the Trajectory Transformer uses a GPT as its
backbone, and is trained to optimize log probabilities of states, actions, and
rewards, conditioned on prior information in the trajectory. This enables
test-time prediction by sampling from the trained model using what is known as
*beam search*. This is another core difference between the Trajectory
Transformer and Decision Transformer. The former uses beam search, the latter
does not, and that’s probably because with discretization, it may be easier to
do multimodal reasoning.

For quantitative results, they again test on D4RL for offline RL experiments.
The results suggest that Trajectory Transformers are competitive with prior
state-of-the-art offline RL algorithms. Again, as with Decision Transformers,
the results aren’t *significant* improvements, but the fact that they’re able
to get to this performance for the first iteration of this approach is
impressive in its own right. They also show a nice *qualitative* visualization
where their Trajectory Transformer can produce a long sequence of predicted
trajectories of a humanoid, whereas a popular state-of-the-art model-based RL
algorithm known as PETS makes significantly worse predictions.

The project website succinctly summarizes the comparisons between Trajectory Transformer and Decision Transformer as follows:

Chen et al concurrently proposed another sequence modeling approach to reinforcement learning. At a high-level, ours is more model-based in spirit and theirs is more model-free, which allows us to evaluate Transformers as long-horizon dynamics models (e.g., in the humanoid predictions above) and allows them to evaluate their policies in image-based environments (e.g., Atari). We encourage you to check out their work as well.

To be clear, the idea that Trajectory Transformer is model-based and that Decision Transformer is model-free is partly because the former predicts states, whereas the latter only predicts actions.

## Concluding Thoughts

Both papers show that we can consider RL as a sequence learning problem, where
Transformers can take in a long sequence of data and predict something. The two
approaches can get around the “deadly triad” in RL since bootstrapping value
estimates is not necessary. The use of Transformers enables building upon an
extensive literature for Transformers in other fields — and it’s *very*
extensive, despite how Transformers are only 4 years old (it has an absurd
22955 Google Scholar citations as of today)! The models use the same
fundamental backbone, and I wonder if there are ways to merge the approaches.
Would beam search, for example, be helpful in Decision Transformers, and would
conditioning on return-to-go be helpful for Trajectory Transformer?

To reitertate, the results are not “out of this world” compared to current state-of-the-art RL using MDPs, but as a first step, these look impressive. Moreover, I am guessing that the research teams are busy extending the capabilities of these models. These two papers have very high impact potential. Assuming the research community is able to improve upon these models, this approach may even become the standard treatment for RL. I am excited to see what will come.