Transformer (Machine Learning)

What is a Transformer?

Transformer in the area of the machine learning are a form of neural networksThe first is the "Attention Mechanism", which makes use of a so-called "attention mechanism". Here, a part of an input variable (for example, a word, a word syllable of a sentence or a pixel of an image) is related to the remaining parts of an input variable.

The aim of this method is that the particular word or pixel contributes to the understanding of the overall data by combining it with the rest of the components. For example, with Search queries on the internet the understanding of pronouns and prepositions in connection with nouns is elementary, as only in this way is it possible and purposeful to grasp the meaning of the overall search.

Transformers are predominantly found in Application in the field of Deep Learning to Text recognition, -processing or the Image recognition.

Architecture in Deep Learning

The structure of a transformer in machine learning is basically divided into an encoder and a decoder. The sequence of the data pass in the encoder and decoder is as follows.

In the first step of the process, the input is "embedded" in the Processable data transferred in the form of vectors. In the next step, the position of the vectors (or words in a sentence) is communicated to the transformer by the "positional encoding" in the form of an indexing. This is followed by the first attention mechanism. In this multi-head attention layer, the transformer compares the data currently being processed (e.g. a word) with all other data (e.g. the remaining words in a sentence) and determines the relevance. Because of this self-comparison, this layer is also called "self-attention".

Now follows the step "Add & Norm" in which the original data is copied unchanged before passing through the multi-head attention layer and is added and normalised with the processed data from the multi-head attention layer. The last layer of the encoder is the "Feed-Forward-layer", which is represented by a neural network with an input, a hidden and an output layer and converts the values to a range from 0 to infinity by a non-linear activation function. The encoder processing is completed with a repeated and previously described "Add & Norm" step.

In the next process step, the decoder starts by initialising and positioning an output sequence. This is done analogously to the encoder by "output embedding" and "positional encoding". This step is followed by running through the "Masked Multi-Head Attention" layer, which is particularly relevant in the training phase of the model. Here, the decoder learns from an actual input. Training data generate or approach a target output. Due to the parallel mode of operation of the transformer in machine learning, the respective position of the individual output sequence is already available to the decoder in training modewhereby the future position of the output sequence is masked or obscured. This layer also gets its name from this.

This layer is followed by another "Add & Norm" step before the data is passed on to the multi-head attention layer. This layer is also referred to as "Encoder-Decoder Attention", as it involves an Connection between the encoder and the decoder is established. It thus connects the input sequences passed through in the encoder with the previously generated output sequences and is therefore also called a "cross-attention" connection. This mechanism is required, for example, in order to Translation of a text into another language to calculate which word of the target language should be ranked next in the sentence. This layer is followed by another "Add & Norm" step, a "Feed-Forward layer", which is similar to the procedure from the encoder, and another "Add & Norm" step. In the penultimate step of the transformer in Deep Learning, the data/vectors processed to date are transferred into a larger vector in the linear layer, in order to be able to represent the entire vocabulary of a target language in the context of a translation, for example. In the final softmax function, a probability between 0 and 1 is calculated for each output sequence and thus the most probable final output is calculated.

Temporal Difference Learning

What is Temporal Difference Learning?

Temporal Difference Learning (also called TD Learning) describes a version of reinforcement learning.This is one of the three learning methods of machine learning, along with supervised learning and unsupervised learning.

As with other reinforcement learning methods, Temporal Difference Learning does not require the learning algorithm to have a starting point or a starting point. Training data necessary. The system, or a software agent, learns through a trial-and-error process in which it receives a reward for a sequence of decisions/actions and aligns and adjusts its future strategy accordingly. The model of the algorithm is based on the Markov decision problem, in which the benefit for a software agent results from a sequence of actions.

Unlike other learning methods, in TD learning the assessment function updates with the appropriate reward after each individual action, rather than after a sequence of actions has been completed. In this way, the strategy iteratively approaches the optimal function. This process is called bootstrapping or bragging and aims to reduce the variance in finding a solution.

What algorithms exist in TD learning?

Within Temporal Difference Learning, several algorithms exist to implement the method.

At Q-Learning the software agent evaluates the utility of an action to be performed instead of the utility level of a state and chooses the action with the greatest increase in utility based on the current evaluation function. In view of this, Q-learning is referred to as an "action-value function" instead of a "state-value function".

Also with SARSA (abbreviation for "state-action-reward-state-action") is an algorithm with an action-value function. In addition to this commonality with Q-learning, SARSA differs from Q-learning in that Q-learning is an off-policy algorithm, whereas SARSA is an on-policy algorithm. In the case of an off-policy, the next state is taken into account for action determination, whereas in the case of on-policy, the algorithm takes into account both the next state and its current action and the agent thus remains true to its strategy for calculating the subsequent action. The algorithms considered so far only take into account the immediate reward of the next action.

With so-called TD n-step methods on the other hand, the rewards of the n next steps are included.

At TD Lambda TD(λ) is an extension of the temporal difference learning algorithm. There is the possibility that not only a single state leads to the adjustment of the evaluation function, but within a sequence the values of several states can be adjusted. The decay rate λ regulates the extent of the possible change for each individual state, whereby this quantity moves away from the state under consideration with each iteration and decreases exponentially. TD-Lambda can also be applied to the methods of Q-learning and SARSA.

What are these algorithms used for in practice?

The areas of application of Temporal Difference Learning in the context of reinforcement learning methods are manifold. A striking example of its use is the game TD-Gammon, which is based on the game Backgammon and was developed using a TD-Lambda algorithm. The same applies to the game AlphaGowhich is based on the Japanese board game Go.

One application of Q-learning can be found in the framework of the autonomous driving in road traffic, as the system independently learns collision-free overtaking strategies and lane changes and then maintains a constant speed.

SARSA, on the other hand, can be used to detect credit card fraud, for example. The SARSA method calculates the algorithm for detecting fraud, while the Classification- and Regression method of a Random-Forest optimised the accuracy of credit card default prediction.

Text recognition (Optical Character Recognition)

What is text recognition?

Optical Character Recognition (OCR) converts analogue text into editable digital text. For example, a printed form is scanned and converted by the OCR software into a text document on the computer, which can then be searched, edited and saved.

Modern OCR text recognition is able to correctly recognise over 99 % of the text information. Words that are not recognised are marked by the programme and corrected by the user.

To further improve the results, OCR text recognition is often supplemented with methods of context analysis (Intelligent Character Recognition, ICR for short). For example, if the text recognition software has recognised "2room", the "2" is corrected to a "Z", resulting in the output of the word "room", which makes sense in context.

There is also Intelligent Word Recognition (IWR), which is supposed to solve the problems of recognising flowing handwriting.

Some examples of free and paid optical character recognition software (in alphabetical order):

  • ABBYY FineReader PDF
  • ABBYY FlexiCapture
  • Adobe Acrobat Pro DC
  • Amazon Textract
  • Docparser
  • FineReader
  • Google Document AI
  • IBM Datacap
  • Klippa
  • Microsoft OneNote
  • Nanonets
  • OmniPage Ultimate
  • PDF Reader
  • Readiris
  • Rossum
  • SimpleOCR
  • Softworks OCR
  • Soda PDF
  • Veryfi

Write an OCR text recogniser yourself with Python or C#

It is possible to work with the programming languages Python or C# itself to incorporate text recognition into scripts. This requires the free OCR library Tesseract, which works for Linux and Windows.

This approach provides a customisable text recognition solution for both scans and photos.

How does Optical Character Recognition software work?

The basis is the raster graphic (image copy of the text), which is created with the help of a scanner or a camera from the physically existing text, for example a book page. The text recognition of a photo is usually more difficult here than with a scan, where the image copy provides very similarly good conditions. With a photo, exposure and the angle at which the document was taken can cause problems, but these can be corrected through the use of AI.

After that, the OCR software works in 3 steps:

1. recognition of the page and outline structure

The scanned graphic is analysed for dark and light areas. Normally, the dark areas are identified as characters to be recognised and the light areas as background.

2. pattern or feature recognition

This is followed by further processing of the dark areas to find alphabetic letters or numeric digits. The approach of the various OCR solutions differs in whether only one character, one word or a text block is recognised at a time. The characters are identified using pattern or feature recognition:

Pattern recognition: The OCR programme compares the characters to be checked with its database of text examples in different fonts and formats and recognises identical patterns.

Feature recognition: The OCR programme applies rules regarding the features of a particular letter or number. Features can be, for example, the number of angled lines, crossed lines or curves in a character.

For example, the information for the letter "F" consists of a long vertical line and 2 short rectangular lines.

3. coding in output format and error control

Depending on the area of application and the software used, the document is saved in different formats. For example, it is output as a Word or PDF file, or saved directly in a database.

In addition, the last step also involves error checking by the user to manually correct words or characters that are not recognised.

How does AI support text recognition?

On the one hand supports Artificial Intelligence (AI) in text recognition already during the optimisation of the raster graphics, especially with photos. If the document to be read in is bent or creased, the text is sometimes too slanted or distorted, which causes problems for the OCR software during processing. With photos, poor exposure and an unsuitable shooting angle can also lead to bad conditions for the OCR software.

With the help of AI, the document can be "smoothed" in its structure, the lighting optimised and the angle corrected, and thus again offers good conditions for text recognition.

On the other hand, AI improves the results of text recognition itself. Artificial intelligence learns with every text and every corrected error. In this way, the errors in text recognition are constantly minimised and the OCR software constantly delivers better results.

Technological Singularity

What is a singularity?

A singularity is a term that is used in different contexts. Singularity events are Occasional occurrences. The root word singulus comes from Latin and means single.

In mathematics as well as in physics and astrophysics there are singularities, which are called a place of infinite curvature of space-time described. In systems theory, on the other hand, singularity is considered the context in which practically a small cause produces a large effect. Geography describes it as an object that is clearly different from its surroundings, but not essential to the landscape. Furthermore describes the technological singularity Theories of futurology.

When does technological singularity occur?

What is interesting in the technological or technical singularity of the Time from which the Artificial intelligence surpasses human intelligence. From this point on, rapid technological improvements are expected, up to and including the self-reproduction of technology.

As a result, technical progress can proceed irreversibly and at an absolute accelerated rate. Technological breakthroughs are being made in the computer industry, in nanomaterials (graphene), optical computers and Quantum computers expected. Exponential growth in IT is expected. Also the singularity in the Robotics is synonymous with a rapid further development of industry and machines. Robots could facilitate or even completely take over work in many industries and sectors.

The future of humanity would no longer be predictable after the occurrence of a technological singularity. If a technological singularity occurred in artificial intelligence, it would not only be able to reproduce and optimise itself, but would also be able to develop its own consciousness. Accordingly, a higher intelligence is the last invention of mankind, since later inventions would largely be produced by machines.

Many futurologists have so far forecast estimates of the technological singularity and have had to postpone them several times by decades into the future. It is likely that the technological singularity will occur quite unexpectedly and cannot be fully predicted even by those involved in its development. The concept of the technological singularity is closely connected with the theories and ideas of transhumanism and posthumanism. It is assumed that technological development can significantly increase the duration of human life expectancy and practically realise biological immortality.

What are singularities in an FEM simulation?

An FEM singularity occurs when numerical equations are to be solved. The structure to be calculated is defined by insufficient boundary conditions, the system of equations cannot be solved and no technically meaningful results are produced by the calculation. The FEM singularity can be avoided by designing the parts as they were manufactured.

Training data

What is training data?

Training data are available within the framework of Artificial intelligence and for Machine learning indispensable to train the system. In unsupervised learning, no examples are needed and the AI system can be trained directly with appropriate input data. Supervised learning, on the other hand, requires sample data. For this data, the target variable is given. The data set is called the sample data set.

In supervised learning, the data set is divided into different data sets: training, validation and test data. These three data sets are then created from the "machine learning flat file" (the sample data set). Thus, the possible division is as follows:

  • 70% Training record
  • 10% Test data set
  • 20% Validation record

The Training dataset is a dataset filled with examples. These are also called target variables. The data set is used for learning patterns and correlations. An adjustment of weights of the algorithm is trained via a training data set. The algorithm thus learns from such data. The training data with the corresponding examples are then used for Regression and classification problems needed. Algorithms tend to over-adapt to learned patterns from the training data. Interrelationships and relationships can then be internalised too much from the training data and as a consequence these rules no longer function with a high degree of accuracy in their entirety.

Test data are independent of training data and should have the same probability distribution as the training data. During training, the test data is not used and the algorithm does not know such data. With the test data, examples and target variables are available and the corresponding quality of the model can then be measured. As soon as the trained model seems to fit the test data correctly and the example data are predicted in a good quality, the model is applied to unknown data to be evaluated.

The Validation data set can also be regarded as an example data set. Such data is used for a Tuning with hyperparameters of a model. Above all, the overfitting of the model to training data is to be avoided.

Why do you need training data?

In general, training data is needed, to set up machine learning and artificial intelligence correctly. The training of systems is supported with requirement-specific training data sets. The required data sets can be newly and individually provided and the data are labelled and annotated. Existing training data and system results are also validated.

One of the most difficult tasks in the development of a machine learning system is the Collecting large amounts of high quality AI training data. Service providers offer unique and newly created AI training data for each of your projects. Thus, photos, audio and video recordings and also texts are supplied and these then support the programming of learning-based algorithms.

What training data do artificial intelligence and machine learning need?

Artificial intelligence is used in route planning, in quality controls in production and in the analysis of X-ray images. Training data for machine learning in particular is becoming increasingly important.

AI systems are trained with suitable data. The patterns recognised in the training data and the information can then be transferred by the systems to unknown data sets after the training process is complete. The need for such training data will increase greatly in the years ahead.

For companies that develop or also use AI, frequently also Records with personal data referenced. Legal requirements must always be observed and complied with when working with training data in machine learning systems. It is the case that data sovereignty and data care must replace data thriftiness as the guiding principle in order to be able to meet the major challenges of the future.