In the last few years, Artificial Intelligence (AI), and especially Natural Language Processing (NLP), have witnessed a revolution driven by one particular neural network architecture, the Transformer. It became so ubiquitous, a bit like cars and internal combustion engines, that it is now a constant target for further improvements. Due to the incredible popularity, we now can experience a wide range of transformers that incorporate various advances addressing multiple aspects of the model. Thus, we start a multi-part blog series offering a bird’s eye view of the different Transformers’ versions.
Attention Is All You Need
First, we revisit the roots of the groundbreaking neural network architecture. In a seminal work from 2017, a group of researchers proposed a challenger to the status quo in NLP, the Transformer. Namely, they criticized the state-of-the-art of the time, where recurrent and convolutional neural networks were dominating.
The researchers identified three critical points of RNNs and CNNs: 1. the total amount of computations they need to perform; 2. the small degree of parallelization of the calculations; 3. the ability to model long-range connections between the elements in a sequence (e.g., words in a sentence). The last one is particularly critical. Taking the example of RNNs, we know that the model processes a sentence or a document sequentially, word for word. By the time the RNN has processed the last element, the information flow only from the immediate neighbors is maintained. At the same time, the data from the beginning of the sentence does not reach the end.
The figure below demonstrates how the information flow moves between the words while the transformers process a sentence. We can see how parallel information flows reach a word at each processing step while receiving information on all words within the context. This approach tackles all three issues that RNNs had.
The Transformer layer, visualized in the graphic below, becomes the core component of many generations of Transformer architectures to come. The architecture has two core components, encoder, visualized on the left side, and decoder, on the right one.
BERT (Bidirectional Encoder Representations from Transformers) is one of the first Transformers to demonstrate a breakthrough after being applied in a transfer learning context. Transfer learning is an approach where a neural network is first trained on a particular task and, after that, further fine-tuned on another one. This method enabled more improvements for the performance of the second task.
The key technical innovation of BERT is masked language modeling (MLM) proposed by researchers at Google AI. The technique enables bi-directional training that uses the same information flow as shown in the initial Transformer’s encoder. It has caused a lot of excitement in the natural language processing community because, at the time, it demonstrated state-of-the-art performance on a variety of benchmarks.
In the figure below, we have a high-level example of how one of the words, w4, is masked. Then the model is asked to guess: What is the actual token in the given context? Only 15% of the words in a sequence are replaced for BERT to train. However, those are randomly replaced with one of the following options:
- 80% are replaced with a special mask token (“[MASK]”) which signals to the model that the word has been “hidden” from it.
- 10% with a random word
- 10% use the original word
In addition, BERT is pre-trained with another task, next sentence prediction (NSP). It is comparable to MLM but on the whole sentence level. BERT is given a pair of sentences and asked to predict whether the second one belongs to the context of the first or not. In 50% percent of the cases, the second sentence is replaced with a random one.
Combining MLM and NSP, BERT can learn a bidirectional representation of the whole sequence that enables state-of-the-art results in benchmarks.
GPT (generatively pre-trained Transformer) and its successors, GPT-2 and GPT-3, are the other contenders for the most popular transformer architecture besides BERT. Researchers at the OpenAI institute proposed it in a seminal work roughly the same time as BERT. It presented benchmark results that are comparable to the ones of BERT.
Unlike BERT, GPT uses the decoder part of the Transformer. Hence, it is pre-trained via causal language modeling (CLM). GPT learns to predict what the next word is for a specific context. This type of language modeling shows inferior performance that could be, for example, used in classification tasks. However, GPT excels in generating very natural-sounding text that sometimes tricks people into believing a human being wrote it.
Over ethical and security concerns, the research team at OpenAI did not initially release resources to reproduce their work, only to do that at a later stage. The most recent version is GPT-3, a behemoth with a total of 175 billion parameters, about which we also wrote in our article GPT-3 – the next level of AI.
We conclude our first part in the blog series. We presented an overview of the first transformers, where we could compare them to earlier approaches like RNNs and distinguish them from each other. Stay tuned for the second part. We will present the second wave of transformers and their new architectural additions that bring further improvements.