What is a Vision Transformer (ViT)?

A vision transformer is a model in the field of image processing that is used primarily in the image recognition of the machine learning is used. It is in the science sector of the Computer Vision a field that analyses and processes photos and images in such a way that the information they contain can also be understood and "seen" by computers. This creates the basis for the further processing of the photos and images.

In 2020, the image recognition method of the Vision Transformer became well known through the paper "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale". Even before that, the so-called transformers were used primarily in speech and text recognition in the Natural Language Processing (NLP) in the area of neural networks used. Since the appearance of the paper, this idea has also been used in a slightly modified form for image processing and image recognition.

Vision Transformers are also included in some programme libraries such as PyTorch or Keras implemented. Both of these are open source libraries that are used in machine learning and Deep Learning be used and for an implementation for the programming language Python or C++ are envisaged.

How is a Vision Transformer constructed?

A vision transformer uses the same computational model or algorithm that is used in text recognition and text processing software such as BERT by Google is applied.

The The heart of the Transformer is the so-called "Attention" dar. Attention denotes a relationship of a part of an input variable (word, pixel or similar) with the other parts from the input variable. Such input variables, which are subsequently suitable for further processing, are called tokens. In text recognition, this can be a word or a word syllable, in image recognition, for example, a single pixel. However, since an image consists of a large number of pixels, applying the algorithm to each individual pixel would hardly be effective from the point of view of the required memory or time. Therefore, the image is divided into individual small sections/patches (e.g. 14×14 or 16×16 pixels).

The next step is to vectorise the sections ("Flattening") and transforms it by linear transformation into "linear embeddings" um. Finally, the patches are given learnable positioning embeddings, which allow the computational model to learn insights about the structure of the image.

Subsequently, the data is processed in a Transformer Encoder. Here, the data is processed on the basis of the existing Training data ViT models are (pre-)classified with the help of attention layers and so-called multi-layer perceptrons (MLP). The ViT-Base, ViT-Large and ViT-Huge models have between 12 and 32 layers and work with 86 million to 632 million parameters. Finally, an MLP head undertakes the final classification. Unlike transformers, which are used by BERT for example, the Vision Transformer does not have a decoder.

What is the difference to a Convolutional Neural Network?

Convolutional Neural Networks (CNN; German: "folding neural network") have been used in the field of computer vision for some time. The "convolution" describes a mathematical operator that is used in the execution of the algorithm.

The Differences between a Convolutional Neural Network and a Vision Transformer lie primarily in the architectural structureeven if there are certain similarities between the areas. While CNNs usually consist of several layers that are processed sequentially, a Vision Transformer works largely in parallel. In the case of CNNs, the convolutional layer and pooling layer play a particularly important role, which can be run through several times in succession and are concluded with one or more fully-connected layers. According to Google their ViT undercuts a state-of-the-art CNN with four times less computing resources.