What is a Transformer?

Transformer in the area of the machine learning are a form of neural networksThe first is the "Attention Mechanism", which makes use of a so-called "attention mechanism". Here, a part of an input variable (for example, a word, a word syllable of a sentence or a pixel of an image) is related to the remaining parts of an input variable.

The aim of this method is that the particular word or pixel contributes to the understanding of the overall data by combining it with the rest of the components. For example, with Search queries on the internet the understanding of pronouns and prepositions in connection with nouns is elementary, as only in this way is it possible and purposeful to grasp the meaning of the overall search.

Transformers are predominantly found in Application in the field of Deep Learning to Text recognition, -processing or the Image recognition.

Architecture in Deep Learning

The structure of a transformer in machine learning is basically divided into an encoder and a decoder. The sequence of the data pass in the encoder and decoder is as follows.

In the first step of the process, the input is "embedded" in the Processable data transferred in the form of vectors. In the next step, the position of the vectors (or words in a sentence) is communicated to the transformer by the "positional encoding" in the form of an indexing. This is followed by the first attention mechanism. In this multi-head attention layer, the transformer compares the data currently being processed (e.g. a word) with all other data (e.g. the remaining words in a sentence) and determines the relevance. Because of this self-comparison, this layer is also called "self-attention".

Now follows the step "Add & Norm" in which the original data is copied unchanged before passing through the multi-head attention layer and is added and normalised with the processed data from the multi-head attention layer. The last layer of the encoder is the "Feed-Forward-layer", which is represented by a neural network with an input, a hidden and an output layer and converts the values to a range from 0 to infinity by a non-linear activation function. The encoder processing is completed with a repeated and previously described "Add & Norm" step.

In the next process step, the decoder starts by initialising and positioning an output sequence. This is done analogously to the encoder by "output embedding" and "positional encoding". This step is followed by running through the "Masked Multi-Head Attention" layer, which is particularly relevant in the training phase of the model. Here, the decoder learns from an actual input. Training data generate or approach a target output. Due to the parallel mode of operation of the transformer in machine learning, the respective position of the individual output sequence is already available to the decoder in training modewhereby the future position of the output sequence is masked or obscured. This layer also gets its name from this.

This layer is followed by another "Add & Norm" step before the data is passed on to the multi-head attention layer. This layer is also referred to as "Encoder-Decoder Attention", as it involves an Connection between the encoder and the decoder is established. It thus connects the input sequences passed through in the encoder with the previously generated output sequences and is therefore also called a "cross-attention" connection. This mechanism is required, for example, in order to Translation of a text into another language to calculate which word of the target language should be ranked next in the sentence. This layer is followed by another "Add & Norm" step, a "Feed-Forward layer", which is similar to the procedure from the encoder, and another "Add & Norm" step. In the penultimate step of the transformer in Deep Learning, the data/vectors processed to date are transferred into a larger vector in the linear layer, in order to be able to represent the entire vocabulary of a target language in the context of a translation, for example. In the final softmax function, a probability between 0 and 1 is calculated for each output sequence and thus the most probable final output is calculated.