Transformers have achieved great success in machine learning, especially NLP with language models (and now image processing). They are one of the most popular topics right now. It is not surprising that big technology companies like Google, Microsoft, Facebook are investing a lot in this technology. We recently reported in our blog GPT-3 blog article about OpenAI’s 175-parameter, GPT-3, which has been exclusively licensed to Microsoft.
In January, Google published a new, comprehensive paper, “Switch Transformers: Scaling to Trillion Parameter Models with Simple And Efficient Sparsity.” In it, a research team proposes a new method to increase the performance of transformers significantly. It allows the number of parameters in a model to scale while maintaining the number of mathematical operations (the standard metric for ML computational costs).
The “Switch-Transformer” sets new standards. The model has 1.7 trillion parameters and makes GPT-3 look like a toy. In this article, we report on the main features of the Switch-Transformer.
What is being “switched”?
In a way, the architecture of a data center served as a template for the model. There, so-called switches ensure that incoming data packets are only forwarded to the devices for which they are intended. The other components remain unchanged. The idea seems trivial—however, it is arriving in machine learning only recently. During the training of a neural network, the input data activated in all parameters of all layers so far.
A switch transformer works in a very similar way. The input data are propagated through the model – and they activate only particular layers instead of all of them. The implicit assumption is that not all information stored in the model is relevant to a specific input. ”So what? What is the big deal?” The answer is quite simple – and yet groundbreaking:
The method decouples the computational cost from the overall size of the model.
Therefore, the inventors of this form of data processing within a neural network called their model a “Mixture of Experts” (MoE). It refers to a machine learning technique that uses multiple experts to partition the problem space into homogeneous regions. Initially, the method was proposed in the 2017 paper: “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” by showing a significant improvement over standard models.
The “Router” in Switch Transformers
To understand the principle in more detail, it helps to have a standard Transformer in mind. The main element is the so-called “attention mechanism.” An attention layer recognizes which input data (e.g., words in a sentence) are relevant to the task at hand. A conventional transformer is a deep stack of attention layers executed in parallel, called multi-head attention layers.
At the end of each of these layers is a feed-forward network (FFN) in the standard architecture. This FFN reassembles the outputs of the different “heads.” And this is precisely where the Switch Transformer comes in. It replaces this aggregation module with several FFNs. These are the “experts.” When data is sent through the model, the model activates exactly one expert for each element in the input. In other words, during a forward pass, a Switch Transformer uses about as many parameters as a standard Transformer with the same number of layers – although it has many times the parameters. In addition, there are the routing parameters, but these are negligible in terms of the computing power required.
It remains to explain how the experts are selected. The process is based on a simple sequence of operations:
- The numerical representation of each word, x, is multiplied by a routing matrix Wᵣ (a learnable parameter trained along with the rest of the model) to obtain a score for each expert: scores = x * W. The W denotes a matrix of learnable parameters trained along with the rest of the model.
- The scores are normalized to a probability distribution to sum up one across all experts: p = softmax(scores).
- x is directed by the expert i with the highest probability. Finally, the output (i.e., the updated token representation) is the activation generated by the expert, weighted by his probability score: x’ = pᵢ * Eᵢ (x).
What are the benefits?
The Google researchers compare another model, Text-To-Text-Transfer-Transformer (T5), to show the advantages of Switch-Transformers. A Switch Transformer with only one expert is the same as T5.
First, they show how with each additional expert, the model gets better. So two heads (or experts) are better than one, after all. They have experimented with mixtures of up to 256 experts, showing improvement. However, it can be seen that the effect of additional experts flattens out with increasing numbers and eventually converges. Once this saturation has been reached, further experts have no benefit.
In a further step, the researchers compare the learning speed of the Switch Transformer with that of the T5. The researchers show that the model with MoE can learn two to seven times faster than its predecessor. It can achieve the same results with two to seven times fewer data.
Last but not least, the Switch Transformer also achieves improvements in many NLP benchmarks, such as text classification and question answering. Most importantly, there is a massive improvement in the Winogrande benchmark, which measures reasoning ability.
The mixture of experts proves to be a reliable method to enable enormous upscaling of Transformers. Switch Transformers can be trained with much less effort. As a result, they set new standards for neural network size and challenging NLP benchmarks. The significant reduction in computational cost allows for a larger number of parameters and offers potential for improvement.