Transformer (machine learning model)

A transformer is a deep learning model. It is distinguished by its adoption of self-attention, differentially weighting the significance of each part of the input (which includes the recursive output) data. It is used primarily in the fields of natural language processing (NLP)[1] and computer vision (CV).[2]

Like recurrent neural networks (RNNs), transformers are designed to process sequential input data, such as natural language, with applications towards tasks such as translation and text summarization. However, unlike RNNs, transformers process the entire input all at once. The attention mechanism provides context for any position in the input sequence. For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs and therefore reduces training times.[1]

Transformers were introduced in 2017 by a team at Google Brain[1] and are increasingly becoming the model of choice for NLP problems,[3] replacing RNN models such as long short-term memory (LSTM).[4] Compared to RNN models, transformers are more amenable to parallelization, allowing training on larger datasets. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and the original GPT (generative pre-trained transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks.[5][6]

Background

Before transformers, most state-of-the-art NLP systems relied on gated RNNs, such as LSTMs and gated recurrent units (GRUs), with added attention mechanisms. Transformers also make use of attention mechanisms but, unlike RNNs, do not have a recurrent structure. This means that provided with enough training data, attention mechanisms alone can match the performance of RNNs with attention.[1]

The terms "query", "key", "value" are borrowed from key–value databases.

Previous work

In 1992, Jürgen Schmidhuber published the fast weight controller as an alternative to RNNs that can learn "internal spotlights of attention,"[7] and experimented with using it to learn variable binding.[8]

In a fast weight controller, a feedforward neural network ("slow") learns by gradient descent to control the weights of another neural network ("fast") through outer products of self-generated activation patterns called "FROM" and "TO" which corresponds to "key" and "value" in the attention mechanism.[9] This fast weight is applied to queries. The attention mechanism may be obtained by interposing a softmax operator and three linear operators (one for each of query, key, and value).[9][10]

Sequential processing

Gated RNNs process tokens sequentially, maintaining a state vector that contains a representation of the data seen prior to the current token. To process the ${\textstyle n}$ th token, the model combines the state representing the sentence up to token ${\textstyle n-1}$ with the information of the new token to create a new state, representing the sentence up to token ${\textstyle n}$ . Theoretically, the information from one token can propagate arbitrarily far down the sequence, if at every point the state continues to encode contextual information about the token. In practice this mechanism is flawed: the vanishing gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens. The dependency of token computations on the results of previous token computations also makes it hard to parallelize computation on modern deep-learning hardware. This can make the training of RNNs inefficient.

Self-attention

These problems were addressed by attention mechanisms. Attention mechanisms let a model draw from the state at any preceding point along the sequence. The attention layer can access all previous states and weigh them according to a learned measure of relevance, providing relevant information about far-away tokens.

A clear example of the value of attention is in language translation, where context is essential to assign the meaning of a word in a sentence. In an English-to-French translation system, the first word of the French output most probably depends heavily on the first few words of the English input. However, in a classic LSTM model, in order to produce the first word of the French output, the model is given only the state vector after processing the last English word. Theoretically, this vector can encode information about the whole English sentence, giving the model all the necessary knowledge. In practice, this information is often poorly preserved by the LSTM. An attention mechanism can be added to address this problem: the decoder is given access to the state vectors of every English input word, not just the last, and can learn attention weights that dictate how much to attend to each English input state vector.

When added to RNNs, attention mechanisms increase performance. In 2016, a new type of highly parallelizable decomposable attention was successfully combined with a feedforward network.[11] This indicated that attention mechanisms were powerful in themselves and that sequential recurrent processing of data was not necessary to achieve the quality gains of RNNs with attention. Soon Jakob Uszkoreit from Google Research also proposed replacing RNNs with self-attention and started the effort to evaluate that idea. Transformers use an attention mechanism without an RNN, processing all tokens simultaneously and calculating attention weights between them in successive layers. Since the attention mechanism only uses information about other tokens from lower layers, it can be computed for all tokens in parallel, which leads to improved training speed.

Architecture

Transformer model architecture

Input

The input text is parsed into tokens by a byte pair encoding tokenizer, and each token is converted via a word embedding into a vector. Then, positional information of the token is added to the word embedding.

Encoder/decoder architecture

Like earlier seq2seq models, the original Transformer model used an encoder/decoder architecture. The encoder consists of encoding layers that process the input iteratively one layer after another, while the decoder consists of decoding layers that do the same thing to the encoder's output.

The function of each encoder layer is to generate encodings that contain information about which parts of the inputs are relevant to each other. It passes its encodings to the next encoder layer as inputs. Each decoder layer does the opposite, taking all the encodings and using their incorporated contextual information to generate an output sequence.[12] To achieve this, each encoder and decoder layer makes use of an attention mechanism.

For each part of the input, attention weighs the relevance of every other part and draws from them to produce the output.[13] Each decoder layer has an additional attention mechanism that draws information from the outputs of previous decoders, before the decoder layer draws information from the encodings.

Both the encoder and decoder layers have a feed-forward neural network for additional processing of the outputs and contain residual connections and layer normalization steps.[13]

Scaled dot-product attention

The transformer building blocks are scaled dot-product attention units. When a sentence is passed into a transformer model, attention weights are calculated between every token simultaneously. The attention unit produces embeddings for every token in context that contain information about the token itself along with a weighted combination of other relevant tokens each weighted by its attention weight.

For each attention unit, the transformer model learns three weight matrices; the query weights $W_{Q}$ , the key weights $W_{K}$ , and the value weights $W_{V}$ . For each token $i$ , the input word embedding $x_{i}$ is multiplied with each of the three weight matrices to produce a query vector $q_{i}=x_{i}W_{Q}$ , a key vector $k_{i}=x_{i}W_{K}$ , and a value vector $v_{i}=x_{i}W_{V}$ . Attention weights are calculated using the query and key vectors: the attention weight $a_{ij}$ from token $i$ to token $j$ is the dot product between $q_{i}$ and $k_{j}$ . The attention weights are divided by the square root of the dimension of the key vectors, ${\sqrt {d_{k}}}$ , which stabilizes gradients during training, and passed through a softmax which normalizes the weights. The fact that $W_{Q}$ and $W_{K}$ are different matrices allows attention to be non-symmetric: if token $i$ attends to token $j$ (i.e. $q_{i}\cdot k_{j}$ is large), this does not necessarily mean that token $j$ will attend to token $i$ (i.e. $q_{j}\cdot k_{i}$ could be small). The output of the attention unit for token $i$ is the weighted sum of the value vectors of all tokens, weighted by $a_{ij}$ , the attention from token $i$ to each token.

The attention calculation for all tokens can be expressed as one large matrix calculation using the softmax function, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. The matrices $Q$ , $K$ and $V$ are defined as the matrices where the $i$ th rows are vectors $q_{i}$ , $k_{i}$ , and $v_{i}$ respectively. Then we can represent the attention as

${\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}$

where softmax is taken over the horizontal axis.

Multi-head attention

One set of $\left(W_{Q},W_{K},W_{V}\right)$ matrices is called an attention head, and each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, with multiple attention heads the model can do this for different definitions of "relevance". In addition, the influence field representing relevance can become progressively dilated in successive layers. Many transformer attention heads encode relevance relations that are meaningful to humans. For example, some attention heads can attend mostly to the next word, while others mainly attend from verbs to their direct objects.[14] The computations for each attention head can be performed in parallel, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.

Concretely, let the multiple attention heads be indexed by $i$ , then we have

{\text{MultiheadedAttention}}(Q,K,V)={\text{Concat}}({\text{Attention}}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}))W^{O}

where the matrices $W_{i}^{Q},W_{i}^{K},W_{i}^{V}$ are "projection matrices" owned by individual attention head $i$ , and $W^{O}$ is a final projection matrix owned by the whole multi-headed attention head.

Masked attention

It may be necessary to cut out attention links between some word-pairs. For example, the decoder for token position $t$ should not have access to token position $t+1$ . This may be accomplished before the softmax stage by adding a mask matrix $M$ that is negative infinity at entries where the attention link must be cut, and zero at other places.

Encoder

Each encoder consists of two major components: a self-attention mechanism and a feed-forward neural network. The self-attention mechanism accepts input encodings from the previous encoder and weights their relevance to each other to generate output encodings. The feed-forward neural network further processes each output encoding individually. These output encodings are then passed to the next encoder as its input, as well as to the decoders.

The first encoder takes positional information and embeddings of the input sequence as its input, rather than encodings. The positional information is necessary for the transformer to make use of the order of the sequence, because no other part of the transformer makes use of this.[1]

The encoder is bidirectional. Attention can be placed on tokens before and after the current token. Tokens are used instead of words to account for polysemy.

A diagram of a sinusoidal positional encoding with parameters

N=10000,d=100

Positional encoding

A positional encoding is a fixed-size vector representation that encapsulates the relative positions of tokens within a target sequence: it provides the transformer model with information about where the words are in the input sequence.

The positional encoding is defined as a function of type $f:\mathbb {R} \to \mathbb {R} ^{d};d\in \mathbb {Z} ,d>0$ , where $d$ is a positive even integer. The full positional encoding - as defined in the original paper - is given by the equation:

(f(t)_{2k},f(t)_{2k+1})=(\sin(\theta ),\cos(\theta ))\quad \forall k\in \{0,1,\ldots ,d/2-1\}

where $\theta ={\frac {t}{r^{k}}},r=N^{2/d}$ .

Here, $N$ is a free parameter that should be significantly larger than the biggest $k$ that would be input into the positional encoding function. In the original paper,[1] the authors chose $N=10000$ .

The function is in a simpler form when written as a complex function of type $f:\mathbb {R} \to \mathbb {C} ^{d/2}$

f(t)=\left(e^{it/r^{k}}\right)_{k=0,1,\ldots ,{\frac {d}{2}}-1}

where $r=N^{2/d}$ . The main reason the authors chose this as the positional encoding function is that it allows one to perform shifts as linear transformations:

f(t+\Delta t)=\mathrm {diag} (f(\Delta t))f(t)

where $\Delta t\in \mathbb {R}$ is the distance one wishes to shift. This allows the transformer to take any encoded position, and find the encoding of the position n-steps-ahead or n-steps-behind, by a matrix multiplication. By taking a linear sum, any convolution can also be implemented as linear transformations:

\sum _{j}c_{j}f(t+\Delta t_{j})=\left(\sum _{j}c_{j}\,\mathrm {diag} (f(\Delta t_{j}))\right)f(t)

for any constants $c_{j}$ . This allows the transformer to take any encoded position and find a linear sum of the encoded locations of its neighbors. This sum of encoded positions, when fed into the attention mechanism, would create attention weights on its neighbors, much like what happens in a convolutional neural network language model. In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position".

In typical implementations, all operations are done over the real numbers, not the complex numbers, but since complex multiplication can be implemented as real 2-by-2 matrix multiplication, this is a mere notational difference.

Other positional encoding schemes exist.[15]

Decoder

Each decoder consists of three major components: a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders. This mechanism can also be called the encoder-decoder attention.[1][13]

Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow.[1] This allows for autoregressive text generation. For all attention heads, attention can't be placed on following tokens. The last decoder is followed by a final linear transformation and softmax layer, to produce the output probabilities over the vocabulary.

All members of OpenAI's GPT series have a decoder-only architecture.

Subsequent work

Sub-quadratic transformers

Training transformer-based architectures can be expensive, especially for long inputs.[16] Alternative architectures include the Reformer (which reduces the computational load from $O(N^{2})$ to $O(N\ln N)$ [16]), or models like ETC/BigBird (which can reduce it to $O(N)$ )[17] where $N$ is the length of the sequence. This is done using locality-sensitive hashing and reversible layers.[18][19]

Ordinary transformers require a memory size that is quadratic in the size of the context window. Attention Free Transformers[20] reduce this to a linear dependence while still retaining the advantages of a transformer by linking the key to the value.

Long Range Arena (2020)[21] is a standard benchmark for comparing the behavior of transformer architectures over long inputs.

Random Feature Attention (2021)[22] uses Fourier random features:

\varphi (x)={\frac {1}{\sqrt {D}}}[\cos \langle w_{1},x\rangle ,\sin \langle w_{1},x\rangle ,\cdots \cos \langle w_{D},x\rangle ,\sin \langle w_{D},x\rangle ]^{T}

where $w_{1},...,w_{D}$ are independent samples from the normal distribution $N(0,\sigma ^{2}I)$ . This choice of parameters satisfy $\mathbb {E} [\langle \varphi (x),\varphi (y)\rangle ]=e^{\frac {\|x-y\|^{2}}{2\sigma ^{2}}}$ , or

e^{\langle x,y\rangle /\sigma ^{2}}=\mathbb {E} [\langle e^{\|x\|^{2}/2\sigma ^{2}}\varphi (x),e^{\|y\|^{2}/2\sigma ^{2}}\varphi (y)\rangle ]\approx \langle e^{\|x\|^{2}/2\sigma ^{2}}\varphi (x),e^{\|y\|^{2}/2\sigma ^{2}}\varphi (y)\rangle

Consequently, the one-headed attention, with one query, can be written as

{\text{Attention}}(q,K,V)={\text{softmax}}\left({\frac {qK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\approx {\frac {\varphi (q)^{T}\sum _{i}e^{\|k_{i}\|^{2}/2\sigma ^{2}}\varphi (k_{i})v_{i}^{T}}{\varphi (q)^{T}\sum _{i}e^{\|k_{i}\|^{2}/2\sigma ^{2}}\varphi (k_{i})}}

where $\sigma =d_{K}^{1/4}$ . Similarly for multiple queries, and for multiheaded attention. This approximation can be computed in linear time, as we can compute the matrix $\varphi (k_{i})v_{i}^{T}$ first, then multiply it with the query. In essence, we have managed to obtain a more precise version of

{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\approx Q(K^{T}V/{\sqrt {d_{k}}})

Performer (2022)[23] uses the same Random Feature Attention, but $w_{1},...,w_{D}$ are first independently sampled from the normal distribution $N(0,\sigma ^{2}I)$ , then they are Gram-Schmidt processed.

Multimodality

Transformers can also be used/adapted for modalities (input or output) beyond just text.

Perceivers by Andrew Jaegle et al. (2021)[24][25] can learn from large amounts of heterogeneous data.

Vision transformers by Jean-Baptiste Cordonnier et al.[26][27] breaks down input images as a series of patches which, once transformed into vectors, are treated like words in a standard transformer.

Regarding image outputs, Peebles et al introduced a diffusion transformer (DiT) which facilitates use of the transformer architecture for diffusion-based image production.[28] Also, Google released a transformer-centric image generator called "Muse" based on parallel decoding and masked generative transformer technology.[29] (Transformers played a less-central role with prior image-producing technologies,[30] albeit still a significant one.[31])

Training

Methods for stabilizing training

The plain Transformer architecture has difficulty converging. In the original paper[1] the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of training steps), before decaying again.

A 2020 paper found that using layer normalization before (instead of after) multiheaded attention and feedforward layers stabilizes training, not requiring learning rate warmup.[32]

Pretrain-finetune

Transformers typically undergo self-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. Pretraining is typically done on a larger dataset than fine-tuning, due to the limited availability of labeled training data. Tasks for pretraining and fine-tuning commonly include:

Applications

The transformer has had great success in natural language processing (NLP), for example the tasks of machine translation and time series prediction. Many pretrained models such as GPT-2, GPT-3, GPT-4, BERT, XLNet, RoBERTa and ChatGPT demonstrate the ability of transformers to perform a wide variety of such NLP-related tasks, and have the potential to find real-world applications. These may include:

In addition to the NLP applications, it has also been successful in other fields, such as Computer vision, or the protein folding applications (such as AlphaFold).

Implementations

The transformer model has been implemented in standard deep learning frameworks such as TensorFlow and PyTorch.

Transformers is a library produced by Hugging Face that supplies transformer-based architectures and pretrained models.[3]

References

Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017-06-12). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
He, Cheng (31 December 2021). "Transformer in CV". Transformer in CV. Towards Data Science. Archived from the original on 16 April 2023. Retrieved 19 June 2021.
Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
Hochreiter, Sepp; Schmidhuber, Jürgen (1 November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
"Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. Archived from the original on 2021-01-13. Retrieved 2019-08-25.
"Better Language Models and Their Implications". OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25.
Schmidhuber, Jürgen (1993). "Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets". ICANN 1993. Springer. pp. 460–463.
Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets". Neural Computation. 4 (1): 131–139.
Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
Schmidhuber, Juergen (2022). "Annotated History of Modern AI and Deep Learning". arXiv:2212.11279 [cs.NE].
"Papers with Code - A Decomposable Attention Model for Natural Language Inference". paperswithcode.com.
"Sequence Modeling with Neural Networks (Part 2): Attention Models". Indico. 2016-04-18. Archived from the original on 2020-10-21. Retrieved 2019-10-15.
Alammar, Jay. "The Illustrated Transformer". jalammar.github.io. Archived from the original on 2020-10-18. Retrieved 2019-10-15.
Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (August 2019). "What Does BERT Look at? An Analysis of BERT's Attention". Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics: 276–286. doi:10.18653/v1/W19-4828. Archived from the original on 2020-10-21. Retrieved 2020-05-20.
Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich (2022-06-06). "Position Information in Transformers: An Overview". Computational Linguistics. 48 (3): 733–763. doi:10.1162/coli_a_00445. ISSN 0891-2017. S2CID 231986066.
Kitaev, Nikita; Kaiser, Łukasz; Levskaya, Anselm (2020). "Reformer: The Efficient Transformer". arXiv:2001.04451 [cs.LG].
"Constructing Transformers For Longer Sequences with Sparse Attention Methods". Google AI Blog. Archived from the original on 2021-09-18. Retrieved 2021-05-28.
"Tasks with Long Sequences – Chatbot". Coursera. Archived from the original on 2020-10-26. Retrieved 2020-10-22.
"Reformer: The Efficient Transformer". Google AI Blog. Archived from the original on 2020-10-22. Retrieved 2020-10-22.
Zhai, Shuangfei; Talbott, Walter; Srivastava, Nitish; Huang, Chen; Goh, Hanlin; Zhang, Ruixiang; Susskind, Josh (2021-09-21). "An Attention Free Transformer". arXiv:2105.14103 [cs.LG].
Tay, Yi; Dehghani, Mostafa; Abnar, Samira; Shen, Yikang; Bahri, Dara; Pham, Philip; Rao, Jinfeng; Yang, Liu; Ruder, Sebastian; Metzler, Donald (2020-11-08). "Long Range Arena: A Benchmark for Efficient Transformers". arXiv:2011.04006 [cs.LG].
Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv:2103.02143 [cs].
Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Belanger, David; Colwell, Lucy; Weller, Adrian (2020-09-30). "Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers". arXiv:2006.03555 [cs, stat].
Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (2021-06-22). "Perceiver: General Perception with Iterative Attention". arXiv:2103.03206 [cs.CV].
Jaegle, Andrew; Borgeaud, Sebastian; Alayrac, Jean-Baptiste; Doersch, Carl; Ionescu, Catalin; Ding, David; Koppula, Skanda; Zoran, Daniel; Brock, Andrew; Shelhamer, Evan; Hénaff, Olivier (2021-08-02). "Perceiver IO: A General Architecture for Structured Inputs & Outputs". arXiv:2107.14795 [cs.LG].
Cordonnier, Jean-Baptiste; Loukas, Andreas; Jaggi, Martin (2019). "On the Relationship between Self-Attention and Convolutional Layers". arXiv:1911.03584 [cs.LG].
Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
https://www.arxiv-vanity.com/papers/2212.09748/
https://www.infoq.com/news/2023/01/google-muse-text-to-image/
https://blog.metaphysic.ai/muse-googles-super-fast-text-to-image-model-abandons-latent-diffusion-for-transformers/
https://www.marktechpost.com/2022/11/14/how-do-dall%C2%B7e-2-stable-diffusion-and-midjourney-work
Xiong, Ruibin; Yang, Yunchang; He, Di; Zheng, Kai; Zheng, Shuxin; Xing, Chen; Zhang, Huishuai; Lan, Yanyan; Wang, Liwei; Liu, Tie-Yan (2020-06-29). "On Layer Normalization in the Transformer Architecture". arXiv:2002.04745 [cs.LG].
Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 353–355. arXiv:1804.07461. doi:10.18653/v1/w18-5446. S2CID 5034059.