Vision Transformer (ViT)

What is a Vision Transformer (ViT)?

A vision transformer is a model in the field of image processing that is used primarily in the image recognition of the machine learning is used. It is in the science sector of the Computer Vision a field that analyses and processes photos and images in such a way that the information they contain can also be understood and "seen" by computers. This creates the basis for the further processing of the photos and images.

In 2020, the image recognition method of the Vision Transformer became well known through the paper "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale". Even before that, the so-called transformers were used primarily in speech and text recognition in the Natural Language Processing (NLP) in the area of neural networks used. Since the appearance of the paper, this idea has also been used in a slightly modified form for image processing and image recognition.

Vision Transformers are also included in some programme libraries such as PyTorch or Keras implemented. Both of these are open source libraries that are used in machine learning and Deep Learning be used and for an implementation for the programming language Python or C++ are envisaged.

How is a Vision Transformer constructed?

A vision transformer uses the same computational model or algorithm that is used in text recognition and text processing software such as BERT by Google is applied.

The The heart of the Transformer is the so-called "Attention" dar. Attention denotes a relationship of a part of an input variable (word, pixel or similar) with the other parts from the input variable. Such input variables, which are subsequently suitable for further processing, are called tokens. In text recognition, this can be a word or a word syllable, in image recognition, for example, a single pixel. However, since an image consists of a large number of pixels, applying the algorithm to each individual pixel would hardly be effective from the point of view of the required memory or time. Therefore, the image is divided into individual small sections/patches (e.g. 14×14 or 16×16 pixels).

The next step is to vectorise the sections ("Flattening") and transforms it by linear transformation into "linear embeddings" um. Finally, the patches are given learnable positioning embeddings, which allow the computational model to learn insights about the structure of the image.

Subsequently, the data is processed in a Transformer Encoder. Here, the data is processed on the basis of the existing Training data ViT models are (pre-)classified with the help of attention layers and so-called multi-layer perceptrons (MLP). The ViT-Base, ViT-Large and ViT-Huge models have between 12 and 32 layers and work with 86 million to 632 million parameters. Finally, an MLP head undertakes the final classification. Unlike transformers, which are used by BERT for example, the Vision Transformer does not have a decoder.

What is the difference to a Convolutional Neural Network?

Convolutional Neural Networks (CNN; German: "folding neural network") have been used in the field of computer vision for some time. The "convolution" describes a mathematical operator that is used in the execution of the algorithm.

The Differences between a Convolutional Neural Network and a Vision Transformer lie primarily in the architectural structureeven if there are certain similarities between the areas. While CNNs usually consist of several layers that are processed sequentially, a Vision Transformer works largely in parallel. In the case of CNNs, the convolutional layer and pooling layer play a particularly important role, which can be run through several times in succession and are concluded with one or more fully-connected layers. According to Google their ViT undercuts a state-of-the-art CNN with four times less computing resources.

Distributed File Systems (DFS)

Computers need operating systems (OSe) to function. Operating systems are the basic level of software that support the basic functions of a computer, make it work and primarily make it usable. Everyone knows the most famous operating systems for personal computers, such as Windows, MAC OS and Linux. One of the most basic functions of the operating system is the data system.

For example, everyone knows the Windows data system that Microsoft provides users with a folder structure in which they can store data in any form, for example as documents, music and pictures. Just like normal computers, computer clusters also need software that enables basic functions, e.g. coordination between different nodes of the cluster. One such software environment for operating a computer cluster is Apache Hadoop. 

Software environments for operating computer clusters must provide a distributed data system. Just as with normal computers, users need a way to store their data in computer clusters. Implementing a data system on a single computer is simple compared to implementing it in a distributed system.

The reason for this is that if you want to store files and documents across multiple computers, they need to be split up and stored in parallel on multiple nodes - all seamlessly for the user. This is very hard to do (just think how difficult it is to remember all the things they have packed in little boxes when they move). Some examples of distributed data systems are the Google File System (GFS) and the Hadoop Distributed File System (HDFS).