What is Google GLaM (Generalist Language Model)?

The Generalist Language Model (GLaM for short) was developed as a Efficient method for scaling language models with a Mixture-of-Experts (MoE) model. introduced by Google. GLaM is a Model with trillions of weights that can be trained and operated efficiently through sparsity while achieving competitive performance on multiple "few-shot" learning tasks. It was evaluated against 29 public benchmarks for natural language processing (Natural Language Processing, NLP for short) evaluated in seven categories ranging from language completion to open-ended question answering and natural language inference tasks.

To develop GLaM, Google created a dataset of 1.6 trillion tokens representing a wide range of use cases for the model. A filter was then created to assess the quality of web page content by training GLaM with text from reputable sources such as Wikipedia and books. This filter was then used to select a subset of web pages, which were combined with content from books and Wikipedia to produce the final Training data set to create.

Functions and capabilities

The MoE model consists of different sub-models, wherebyi each sub-model or expert specialises in different inputs is. The gating network controls the experts in each layer and selects the two most appropriate experts to process the data for each token. The full version of GLaM has 1.2T total parameters for 64 experts per MoE layer with a total of 32 MoE layers, but only activates a subnet of 97B parameters per token prediction during inference.

Google GLaM allows Different experts activated on different types of inputs which is a collection of E x (E-1) various Feedforward network combinations for each MoE layer, resulting in greater computational flexibility. The final learned representation of a token is the weighted combination of the outputs of the two experts. To allow scaling to larger models, each expert within the GLaM architecture can span multiple computational units.

GLaM was evaluated using a zero-shot and one-shot setting where tasks are never seen during training. It performed competitively on 29 public NLP benchmarks ranging from cloze and completion tasks to open-ended question answering, Winograd-like tasks, commonsense reasoning, in-context reading comprehension, SuperGLUE tasks and natural language inference. The Performance of GLaM is comparable to a dense language model, such as GPT-3 (175B), with significantly improved learning efficiency in the 29 public NLP benchmarks. GLaM reduces to a fundamentally dense, on Transformer based language model architecture when each MoE layer has only one expert. The performance and scaling properties of GLaM were investigated and compared with baseline Dense models trained on the same datasets.