Transformers have made a big splash in the field of machine learning, especially in the use of language processing with language models (and now image processing). They are one of the most popular topics at the moment and it is not surprising that big tech companies like Google, Microsoft, Facebook are investing heavily in this technology. In our blog article GPT-3 - The next level of AI we have already reported on the 175-parameter of OpenAI, GPT-3, which was exclusively licensed to Microsoft.
In January, Google published a new, comprehensive paper, "Switch Transformers: Scaling to Trillion Parameter Models with Simple And Efficient Sparsity"was published. In it, the group proposes a new method to significantly increase the performance of transformers. It makes it possible to multiply the number of parameters in a model - while maintaining the number of mathematical operations (the standard metric for ML computational costs).
New standards are set with the "Switch-Transformer". The model has 1.7 trillion parameters and makes GPT-3 look like a toy.In this article we report on the main features of switch transformers.
What is "changed"?
In a way, the architecture of a computer centre served as a model for the model. There, so-called switches ensure that incoming data packets are only forwarded to the devices for which they are intended. The other components remain unaffected. The idea seems banal - but in fact it has not yet arrived in machine learning. When training a neural network, the input data is activated in all parameters of all layers.
A switch-transformer works in a very similar way. The input data is propagated through the model - activating only certain layers, but not all. The implicit assumption is that not all the information stored in the model is relevant to a particular input."So what?" you may now be thinking - "what's the big deal?".The answer is quite simple - and yet groundbreaking: the method decouples the computational costs from the overall size of the model.
The pioneers of this form of data processing within a neural network have therefore called their model the "Mixture of Experts" (MoE). It refers to a technique of the machine learningwhere several experts are used to divide the problem space into homogeneous regions. Originally, the technique was described in the paper: "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"of 2017, showing a significant improvement over standard models.
The "router" of the Switch Transformers
To understand the principle in more detail, it helps to look at a standard transformer. The key element is the so-called "attention mechanism". An attention layer recognises which input data - for example, which words in a sentence - are relevant to the task at hand. A conventional transformer is a deep stack of attention layers executed in parallel, so-called multi-head attention layers.
At the end of each of these layers, there is a feed-forward network (FFN) in the standard architecture. This FFN reassembles the outputs of the different "heads".And this is exactly where the Switch Transformer comes in. It replaces this aggregation module with several FFNs. They are the "experts".If data is now sent through the model, the model activates exactly one expert for each element in the input. In other words: During a forward pass, a switch transformer uses about as many parameters as a standard transformer with the same number of layers - although it has many times the parameters of the standard transformer. In addition, there are the routing parameters, but these are negligible in terms of the computing power required.
It remains to explain how the experts are selected. The process is based on a simple sequence of operations:
- The numerical representation of the individual words, x, is realised with a routing matrix Wᵣ (a learnable parameter that is trained along with the rest of the model) to obtain a score for each expert: scores = x * W. The W denotes a matrix of learnable parameters that are trained together with the rest of the model.
- The scores are normalised to a probability distribution so that they sum to 1 across all experts: p = softmax(scores).
- x is supported by the expert i directed with the highest probability. Finally, the output (i.e. the updated token representation) is the activation generated by the expert, weighted by its probability score: x' = pᵢ * Eᵢ (x).
What are the advantages?
The Google researchers make a comparison to another model, Text-To-Text-Transfer-Transformer (T5), to show the advantages of Switch-Transformers. A Switch-Transformer with only one expert is the same as T5.
First, they show how with each additional expert the model gets better. So two heads (or experts) are better than one after all. They experimented with mixtures of up to 256 experts and showed an improvement. But it turns out that the effect of additional experts flattens out with increasing numbers and eventually stagnates. Once this saturation is reached, additional experts have no further positive effect.
In a further step, the learning speed of the Switch Transformer was compared with that of the T5. The researchers show that the model with MoE can learn two to seven times as fast as its predecessor, i.e. it can achieve the same results with two to seven times less data.
Last but not least, the Switch Transformer also achieves improvements in many NLP benchmarks, such as text classification and question answering. Most importantly, there is a very big improvement in the Winogrande benchmark, which measures reasoning skills.
The mixture of experts proves to be a reliable method to enable enormous upscaling of Transformers. Switch Transformers can be trained with much less effort. As a result, they set new standards not only for the size of neural networksbut also for challenging NLP benchmarks. The significant reduction in computational costs allows for a larger number of parameters and thus offers potential for improvement.