Transformer Networks for State of the Art Natural Language Processing

The Transformer network, introduced by Google researchers in their paper titled “Attention Is All You Need” at NIPS 2017, was perhaps the most groundbreaking result in all of Deep Learning in the year 2017. In part because the CS 182/282A class has a homework assignment on Transformer networks, I finally got around to understanding some of its details, which I will discuss in this post.

Before we proceed, be aware that despite the similar-sounding names, Transformer networks are not the same thing as spatial transformer networks. Transformer networks are deep neural networks that use attention and eschew convolutional and recurrent layers, while spatial transformer networks are convolutional neural networks with a trainable module that enable increased spatial invariance to the input, which might include size and pose changes.

Introduction and Context

Way back in Fall 2014, I took CS 288 at Berkeley, which is the graduate-level Natural Language Processing course we offer. That was happening in tandem with some of the earlier post-AlexNet Deep Learning results, but we didn’t do any Deep Learning in 288. Fast forward five years later, and I’m sure the course is vastly different nowadays. Actually, I am not even sure it has been offered since I took it! This is probably because Professor Klein has been working at Semantic Machines (which recently got acquired by Microsoft, congratulations to them) and because we don’t really have another pure NLP research faculty member.

I did not do any NLP whatsoever since taking CS 288, focusing instead on computer vision and robotics, which are two other sub-fields of AI which, like NLP, heavily utilize Deep Learning. Thus, focusing on those fields meant NLP was my weakest subject of the “AI triad.” I am trying to rectify that so I can be conversationally fluent with other NLP researchers and know what words like “Transformers” (the subject of this blog post!), “ELMO,” “BERT,” and others mean.

In lieu of Berkeley-related material, I went over to Stanford to quickly get up to speed on Deep Learning and NLP. Fortunately, Stanford’s been on top of this by providing CS 224n, the NLP analogue of the fantastic CS 231n course for computer vision. There are polished lecture videos online for the 2017 edition and then (yes!!) the 2019 edition, replete with accurate captions! I wonder if the people who worked on the captions first ran a Deep Learning-based automatic speech recognizer, and then fine-tuned the end result? Having the 2019 edition available is critical because the 2017 lectures happened before the Transformer was released to the public.

So, what are Transformers? They are a type of deep neural network originally designed to address the problem of transforming one sequence into another. A common example is with machine translation (or “transduction” I suppose — I think that’s a more general umbrella term), where we can ask the network to translate a source sentence into a target sentence.

From reviewing the 224n material, it seems like from the years 2012 to 2017, the most successful Deep Learning approaches to machine translation involved recurrent neural networks, where the input source was passed through an encoder RNN, and then the final hidden state would be used as input to a decoder RNN which then provided the output sequentially. One of the more popular versions of this model was from (Sutskever et al., 2014) at NIPS. This was back when Ilya was at Google, before he became Chief Scientist of OpenAI.

One downside is that these methods rely on the decoder being able to work with a fixed-size, final hidden state of the encoder, which must somehow contain all the relevant information for translation. Some work has addressed that by allowing the decoder to use attention mechanisms to “search backwards” at earlier hidden layers from the decoder. This appears to be the main idea from (Loung et al., 2015).

But work like that still uses an RNN-based architecture. RNNs, including more advanced versions like LSTMs and GRUs, are known to be difficult to train and parallelize. By parallelize, we mean within each sample; we can clearly still parallelize across $B$ samples within a minibatch. Transformer networks can address this issue, as they are not recurrent (or convolutional, for that matter!).

(Some) Architectural Details

Though the Transformer is billed as a new neural network architecture, it still follows the standard “encoder-decoder” that are commonly used in seq2seq models, where the encoder produces a representation of the word, and the decoder autoregressively produces the output. The main building blocks of the Transformer are also not especially exotic, but rather clever arrangements of existing operations that we already know how to do.

Probably the most important building block within the Transformer is the Scaled Dot-Product Attention. Each involves:

A set of queries $\{q_1, \ldots, q_{n_q}\}$, each of size $d_k$.
A set of keys $\{k_1, \ldots, k_{n_k}\}$, each of size $d_k$ (same size as the queries).
A set of values $\{v_1, \ldots, v_{n_v}\}$, each of size $d_v$. In the paper, they set $d_k=d_v$ for simplicity. Also, $n_v = n_k$ because they refer to a set of key-value pairs.

Don’t get too bogged down with what these three mean. They are vectors that we can abstract away. The output of the attention blocks are the weighted sum of values, where the weights are determined by the queries and keys from a compatibility function. We’ll go through the details soon, but the “scaled dot-product attention” that I just mentioned is that compatibility function. It involves a dot product between a query and a key, which is why we constrain them to be the same size $d_k$.

Let’s stack these into matrices, which we would have to do for implementation purposes. We obtain:

\[Q = \underbrace{\begin{bmatrix} \cdots & q_1 & \cdots \\ & \vdots & \\ \cdots & q_{n_q} & \cdots \end{bmatrix}}_{n_q \times d_k} ,\quad K = \underbrace{\begin{bmatrix} \cdots & k_1 & \cdots \\ & \vdots & \\ \cdots & k_{n_k} & \cdots \end{bmatrix}}_{n_k \times d_k} ,\quad V = \underbrace{\begin{bmatrix} \cdots & v_1 & \cdots \\ & \vdots & \\ \cdots & v_{n_k} & \cdots \end{bmatrix}}_{n_k \times d_v}\]

where individual queries, keys, and values are row vectors within the arrays. The Transformers paper doesn’t specify the exact row/column orientation because I’m sure it doesn’t actually matter in practice; I found the math to be a little easier to write out with treating elements as row vectors. My LaTeX here won’t be as polished as it would be for an academic paper due to MathJax limitations, but think of the above as showing the pattern for how I portray row vectors.

The paper then states that the output of the self-attention layer follows this equation:

\[{\rm Attention}(Q,K,V) = {\rm softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

For the sake of saving valuable column space on this blog, I’ll use ${\rm Attn}$ and ${\rm soft}$ for short. Expanding the above, and flipping the pattern in the $K$ matrix (due to the transpose), we get:

\[\begin{align} {\rm Attn}(Q,K,V) &= {\rm soft} \left(\frac{1}{\sqrt{d_k}} \underbrace{\begin{bmatrix} \cdots & q_1 & \cdots \\ & \vdots & \\ \cdots & q_{n_q} & \cdots \end{bmatrix}}_{n_q\times d_k} \underbrace{\begin{bmatrix} \vdots & & \vdots \\ k_1 & \cdots & k_{n_k} \\ \vdots & & \vdots \end{bmatrix}}_{d_k \times n_k} \right) \underbrace{\begin{bmatrix} \cdots & v_1 & \cdots \\ & \vdots & \\ \cdots & v_{n_k} & \cdots \end{bmatrix}}_{n_k \times d_v} \\ &= {\rm soft} \left(\frac{1}{\sqrt{d_k}} \underbrace{\begin{bmatrix} q_1^Tk_1 & \cdots & q_1^Tk_{n_k} \\ \vdots & \vdots & \vdots \\ q_{n_q}^Tk_1 & \cdots & q_{n_q}^Tk_{n_k} \end{bmatrix}}_{n_q \times n_k} \right) \underbrace{\begin{bmatrix} \cdots & v_1 & \cdots \\ & \vdots & \\ \cdots & v_{n_k} & \cdots \end{bmatrix}}_{n_k \times d_v} \\ &= \underbrace{\begin{bmatrix} {\rm soft}\Big( \frac{q_1^Tk_1}{\sqrt{d_k}}, \ldots, \frac{q_1^Tk_{n_k}}{\sqrt{d_k}} \Big) \\ \vdots \\ {\rm soft}\Big( \frac{q_{n_q}^Tk_1}{\sqrt{d_k}}, \ldots, \frac{q_{n_q}^Tk_{n_k}}{\sqrt{d_k}} \Big) \\ \end{bmatrix}}_{n_q \times n_k} \underbrace{\begin{bmatrix} \cdots & v_1 & \cdots \\ & \vdots & \\ \cdots & v_{n_k} & \cdots \end{bmatrix}}_{n_k \times d_v} \\ \end{align}\]

In the above, I apply the softmax operator row-wise. Notice that this is where the compatibility function happens. We are taking dot products between queries and keys, which are scaled by $1/\sqrt{d_k}$, hence why we call it scaled dot-product attention! It’s also natural why we call this “compatible” — a higher dot product means the vectors are more aligned in their components.

The last thing to note for this sub-module is where the weighted sum happens. It comes directly from the matrix-matrix product that we have above. It’s easier to think of this with respect to one query, so we can zoom in on the first row of the query-key dot product matrix. Rewriting that softmax row vector’s elements as $\{w_1^{(1)}, \ldots, w_{n_k}^{(1)}\}$ to simplify the subsequent notation, we get:

\[\underbrace{\begin{bmatrix} w_1^{(1)} & \ldots & w_{n_k}^{(1)} \\ \end{bmatrix}}_{1 \times n_k} \underbrace{\begin{bmatrix} \cdots & v_1 & \cdots \\ & \vdots & \\ \cdots & v_{n_k} & \cdots \end{bmatrix}}_{n_k \times d_v} = \sum_{i=1}^{n_k} w^{(1)}_i v_i\]

and that is our weighted sum of values. The $w_i^{(1)}$ are scalars, and the $v_i$ are $d_v$-sized vectors.

All of this is for one query, which takes up one row in the matrix. We extend to multiple queries by adding more queries as rows.

Once you understand the above, the extensions they add onto this are a little more straightforward. For example, they don’t actually use the $Q$, $K$, and $V$ matrices directly, but linearly project them using trainable parameters, which is exactly what a dense layer does. But understanding the above, you can abstract the linear projection part away. They also perform several of these in parallel and then concatenate across one of the axis, but again, that part can be abstracted away easily.

The above happens within one module of an encoder or decoder layer. The second module of it is a plain old fully connected layer:

\[{\rm FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\]

They then wrap both the self-attention part and the fully connected portion with a residual connection and layer normalization. Whew! All of this gets repeated for however many layers we can stomach. They used six layers for the encoder and decoder.

Finally, while I won’t expand upon it in this blog post, they rely on positional embeddings in order to make use of the sequential nature of the input. Recurrent neural networks, and even convolutional neural networks, maintain information about the ordering of the sequence. Positional embeddings are thus a way of forcing the network to recognize this in the absence of these two such layers.

Experiments and Results

Since I do not do NLP research, I am not sure about what benchmarks researchers use. Hence, that’s why I want to understand their experiments and results. Here are the highlights:

They performed two machine translation tasks: English-to-German and English-to-French. Both use “WMT 2014” datasets. Whatever those are, they must presumably be the standard for machine translation research. The former consists of 4.5 million sentence pairs, while the latter is much larger at 36 million sentence pairs.
The metric they use is BLEU, the same one that NLP-ers have long been using for translation. and they get higher numbers than prior results. As I explained in my blog post about CS 288, I was not able to understand the spoken utterances of Professor Klein, but I think I remember him saying something like this about BLEU: “nobody likes it, everybody uses it”. Oh well. You can see the exact BLEU scores in the paper. Incidentally, they average across several checkpoints for their actual model. I think this is prediction averaging rather than parameter averaging but I am not sure. Parameter averaging could work in this case because the parameters are relatively constrained in the same subspace if we’re taking about consecutive snapshots.
They train their model using 8 NVIDIA P100 GPUs. I understand that Transformers are actually far more compute efficient (with respect to floating point operations) than prior state of the art models, but man, AI is really exacerbating compute inequality.
WOW, Table 3 shows results from varying the Transformer model architecture. This is nice because it shows that Google has sufficiently explored variations on the model, and the one they present to us is the best. I see now why John Canny once told me that this paper was one of the few that met his criteria of “sufficient exploration.”
Finally, they generalize the Transformer to the task of English constituency parsing. It doesn’t look like they achieve SOTA results, but it’s close.

Needless to say, the results in this paper are supremely impressive.

The Future

This paper has 1,393 citations at the time of me writing this, and it’s only been on arXiv since June 2017! I wonder how much of the results are still SOTA. I am also curious about how Transformers can be “plugged into” other architectures. I am aware of at least two notable NLP architectures that use Transformers:

BERT, introduced in the 2018 paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. It’s another Google product, but one which at the time of publication had noticeable SOTA results. You can see one of Jay Alammar’s excellent “illustrations” of it here. It looks like I have another blog added to my reading list!
GPT-2, released in February 2019 in the form of a blog post and a preprint in ICML format (but I’m not sure if it was actually submitted to ICML, since OpenAI appears to have changed their recent publication habits). Oh yeah. Oh yeah. I have no direct comment on the controversy this created, apart from worrying that the media will continue with their click-baiting headlines that don’t accurately reveal the context and limitations of results.

I’m sure there are so many other applications of Transformers. I wonder how often Transformers have been used in robotics?