Data (information theory)

What are data?

Data are digital mediawhich can be read, edited and stored by a computer or other electronic device. They are available in different formats whose Coding of a specific syntax follows.

In computer science, data is almost exclusively represented in binary form. The term bit is a unit of measurement for the amount of data. Examples of digital formats include text, images, numbers and audio and video files. Exactly what information data represents must be inferred from the respective context. For example, the sequence of digits 12345678 a telephone number or credit card number. It only acquires its concrete meaning through the processing of programmes or algorithms.

What types of data exist?

Structured data

Structured data are arranged in a certain way so that they have a similar shape have. Examples are records, fields or lists. Structured data is found in particular in relational Databases Application. Information is sorted and formatted before being stored in corresponding fields. It can then be queried and edited via a database language such as SQL.

Semi-structured data

In contrast to structured data, semi-structured data has the following features No fixed scheme. They are hierarchically structured and can be extended by nested information. In addition, the order of the attributes is unimportant for semi-structured data. An entity of a class can have several attributes at the same time.

Unstructured data

Unstructured lie in no standardised format before. A structure must first be obtained from them before they can be stored in a database. Examples of unstructured data are images, texts, video and audio recordings. They often contain a lot of relevant information, which is especially important in the area of Big Data are of importance.

What is data management?

Data management refers to all technical and organisational Data management measuresin order to use them efficiently and improve business processes. Companies should therefore have a comprehensive data strategy that defines the goals of data management. The essential methods of data management include:

For the data management of companies, it is above all Consolidation play an important role. Consolidation uses aggregation to bring together data from different systems or departments into a single source. A central view is created and redundancies are reduced. For optimal consolidation, a suitable data architecture and high data quality are necessary.

How can data be stored?

Electronic data storage

Electronic data memories consist of semiconductor components whose circuits are almost exclusively based on silicon. They are divided into fugitive (e.g. RAM), permanent (e.g. SSDs) and semi-permanent memories (e.g. memory cards, USB sticks).

Magnetic data storage

Magnetisable material such as tapes or disks is used for this type of storage. A distinction is made between rotating and non-rotating storage media. With rotating disks, data is read or overwritten with the help of a read-write head. Non-rotating storage media such as magnetic tapes or cards are pulled past a fixed head.

Optical data storage

A laser beam is used to read and write data on optical data carriers. The reflective properties of the medium are used for storage. Examples of optical data storage media are CDs or DVDs.

Cloud-based storage

Cloud

3. Cloud computing data is stored and managed externally via the internet. The files can thus be accessed from any location. In addition, cloud storage is highly scalable.

Edge

Edge computing is a form of decentralised data processing that takes place close to the data source or the user. This allows data to be processed faster and more securely.

Fog

Fog Computing is a cloud concept in which data can be managed decentrally in local mini-data centres. Fog nodes are switching nodes that decide whether data must be forwarded to central or decentralised end points. This reduces the communication path and saves computing power.

What is personal data?

Personal data is information that Assigned to an identifiable person can be obtained. This includes, for example, name, address, date of birth, telephone number, email address, national insurance number or IP address.

Likewise, personal data can also identify categories of data, such as medical data, political or religious beliefs. The term has been legally defined since the entry into force of the European General Data Protection Regulation (GDPR) on 25 May 2018.
Companies must comply with legal regulations when processing personal data. This includes technical and organisational measures to minimise, protect and be transparent about the collection, processing and disclosure of personal data.

Data Augmentation

What is Data Augmentation?

Data augmentation is a process in which the data is artificially generates new data on the basis of an existing data set in order to increase the totality of the data. In this respect, the technology is seen as a preparatory step in the field of machine learning applied. By means of prefabricated libraries in Python or PyTorch the functionality can be implemented.

Benefits and challenges

A Advantage of data augmentation through data augmentation results from the possibility of Reduction from Overfitting. This over-adaptation occurs, for example, when Training data cannot be sufficiently generalised, for example if the amount of training data is too small. The generation of augmented data can counteract the problem of overfitting, as it increases the amount of data.

A further benefit from artificial data generation arises in that Prevents potential data protection problems The data can only be generated through data augmentation. Furthermore, this technique can be used to collect and label data in a cost-effective way.

Challenges The augmented data, once it has been generated, can be subjected to a qualitative assessment through a rating system must be subjected to in order to capture the added value of the data expansion. Biases in original data cannot be eliminated by this method, but are carried over. To reduce this problem, an optimal extension strategy can be developed.

How it works

The procedure of data augmentation in the sense of the standard model works in such a way that the original data (e.g. an image) is loaded into the data augmentation pipeline. In this pipeline, so-called transformation functions are applied to the input data with a certain probability. These can be, for example, the Rotate (rotating) or Mirroring (flipping) of the image. After passing through the pipeline, the generated results are evaluated by a human expert. If the generated data has passed the inspection, it flows into the training data population as augmented data.

What are data augmentation techniques?

Within the framework of the Image classification and segmentation several techniques can be used to expand the training data. After loading the original image into the pipeline, the image can, for example, be extended by a frame, mirrored horizontally or vertically, rescaled, moved along the x- or y-axis, rotated, cropped or zoomed into. In addition to the possibilities mentioned for modifying an image, there are also those that concern colour or contrast. These concern colour modifications such as brightening or darkening the image, converting the image to greyscale, changing the contrast, adding noise or deleting parts of the image. Each of the activities included is applied to the original image with a certain probability, ultimately creating augmented data.

In addition to image classification and segmentation, the technique is also used in the field of Natural Language Processing (NLP) Application. Since NLP deals with the processing of natural language, meaningful data generation is more difficult. Applicable techniques are synonym substitution and the insertion, exchange or deletion of words, which can be summarised under the term Easy Data Augmentation (EDA). Another method is back-translation. Here, a text is back-translated from the target language into the original language and thus expands the data set of the training data. Augmented data can also be created by so-called contextualised embedding of words.

Where is Data Augmentation used?

The process is particularly well represented in the Medical imaging sectorsuch as in the segmentation of tumours or in the identification of diseases on X-ray images. Since only a limited data set is available for rare diseases, this can be expanded through data augmentation. Another use case can be found in the Area of the autonomous driving. Data augmentation is used to extend the simulation environment. Also in the Field of Natural Language Processing data augmentation is used. It is also used to augment the training data for NLP applications.

Deduction

What is deduction?

Deduction is a term from logic and comes from the Latin word deductiowhich means "derivation" or "derivation". It denotes a logical conclusion from the general to the particular. It is also understood as theory to empiricism.

The basis is the inheritance of properties of higher-level elements to their subsets. Through a general theory, statements can thus be made about concrete individual cases. The precondition or assumption is also called premise. From one or more premises, the logical consequence follows with the help of inference rules, which is compellingly or deductively valid. The truth of the premise leads to the truth of the conclusion. No false conclusion may arise from a true premise.

Deductive inferences, like other scientific methods, are not verifiable, but only falsifiable. That is, their validity is assumed as long as there is no counter-evidence or new knowledge. In the field of artificial intelligence deduction plays an essential role in logic programming and automatic proof.

What are examples of deduction?

A classic example of deductive reasoning comes from Aristotle:

All human beings are mortal. Socrates is a human being. It follows that Socrates is mortal.

The premises "all human beings are mortal" and "Socrates is a human being" are true. The property "mortal" of the superordinate category human being is transferred to the concrete example of Socrates.

Another example of deductive reasoning is:

Pilots have a quick reaction time. He is a pilot. He has a quick reaction time.

The premise here is that the characteristic of a quick reaction time applies to pilots in general. According to the premise, a concrete representative of the category therefore possesses a fast reaction capability, otherwise he would not be a pilot. The statement is therefore true.

Also in the detective stories Sherlock Holmes the deductive method is present. In The blue carbuncle Holmes estimates the socio-economic background of the wearer of an old hat based on general phenomena. The size and quality of the found hat indicate an intellectual and wealthy person. However, since the hat is aging and full of dust, Holmes logically assumes that the owner is no longer financially well off and rarely leaves the house.

What are the differences between induction and abduction?

Deduction vs. induction

Induction (lat. inducere "bring about") is the reverse process to deduction. Here, a general conclusion is formed from a concrete observation or phenomenon. The path is therefore from empiricism to theory. Collecting data on individual elements leads to the realisation of properties that all representatives of a group or category possess.

Example:

The little sparrow lays eggs. The sparrow is a bird. All birds lay eggs.

The specific premise here is the egg-laying sparrow, which belongs to the group of birds. From the observation of the sparrow follows the abstract conclusion about the behaviour of all birds.

Induction and deduction never occur in pure form. The premises used in deductive reasoning are closely linked to empirical findings and induction to already established theory. The procedures differ essentially in the question of whether a regularity is to be verified (deduction) or a new one formed (induction).

Deduction vs. abduction

A third method of logical reasoning is the Abduction (lat. abducere "to lead away"). The term was introduced by the American philosopher Charles Sanders Peirce. It differs from induction and deduction in that it extends knowledge. An unknown cause is derived from two known conclusions.

Example:

These apples are red. All the apples from this basket are red. These apples are from this basket.

From the result, the rule "all apples from this basket are red" is used to infer the case "these apples are from this basket". Abductive reasoning is a presumption based on circumstantial evidence.

DALL-E

What is DALL-E?

DALL-E is a neural network, which is based on artificial intelligence and creates images from descriptions. It was unveiled in early 2021 by OpenAI after years of work preceding the programme. OpenAI is a company dedicated to the research and development of artificial intelligence. Investors include Elon Musk and Microsoft. The name is a combination of the term WALL-E, a science fiction film by Pixar, and the name of the surrealist artist Salvador Dalí.

Function of the algorithm

DALL-E uses a 12-billion-parameter version of the GPT-3 Transformer model. The abbreviation GPT stands for Generative Pre-Trained and the "3" for the now third generation. GPT-3 is an autoregressive language model. It uses the method of the Deep Learningto produce human-like text. The quality is now so high that it is not always easy to tell whether the text was written by a machine or a human.

DALL-E interprets input in natural language and generates images from it. It uses a database of pairs of images and texts. To do this, it works with the zero-shot learning method. It generates a pictorial output from a description without further training and works together with CLIP. CLIP was also developed by OpenAI and means "Connecting Text and Images". It is a separate neural network that understands and classifies the text output.

Text and image come from a single data stream containing up to 1280 tokens. The algorithm is trained under the maximum probability of generating all tokens in succession. The Training data enable the neural network to create images from scratch as well as revise existing images.

What are the capabilities of DALL-E?

DALL-E has a wide range of capabilities. It can display photorealistic images of both real and non-real objects, or output paintings and emojis. It can also manipulate or rearrange images.

In addition, in many cases the neural network is able to fill in gaps and display details on images that were not explicitly mentioned in the description. For example, the algorithm has already converted the following representations from text descriptions:

  • a blue rectangular circle within a green square
  • the cross-section of a cut apple
  • a painting of a cat
  • the façade of a shop with a certain lettering

Deep Generative Models

What are Deep Generative Models?

A Deep Generative Model (DGM) is a neural network in the subdomain of the Deep Learningswhich follow the generative modelling approach. The opposite of this approach is the discriminative modelling, which is designed to find the best solution on the basis of the existing Training data Identify decision boundaries and classify the input accordingly.

The generative approach, on the other hand, follows the strategy of learning the data distribution from training data and creating new data points based on the learned or approximated distribution according to the word origin. While discriminative modelling is attributed to supervised learning, generative modelling is usually based on unsupervised learning.

Deep generative models thus ask the question of how data are generated in a probability model, while discriminative models aim to make classifications based on the existing training data. The generative models try to understand the probability distribution of the training data and generate new or similar data on the basis of this. For this reason, one area of application of deep generative models is image generation based on sample images, such as in the neural network DALL-E.

What are Flow-based Deep Generative Models?

A flow-based deep generative model is a generative model that is able to interpret and model a probability distribution of data. This can be illustrated with the help of the so-called "normalising flow".

The normalising flow describes a statistical method with which density functions of probability distributions can be estimated. In contrast to other types of generative models, such as the Generative Adversarial Networks (GAN) or Variational Autoencoder (VAE), flow-based deep generative models generate the "flow" through a sequence of invertible transformations. This allows the likelihood function to be represented and thus the true probability distribution to be learned.

In Generative Adversarial Networks, on the other hand, the methodology consists of a generator and a discriminator, which are to be seen as opponents. The generator produces data which the discriminator tries to identify as falsification (i.e. as not being part of the given, real distribution). The goal of the generator, on the other hand, is to ensure that the generated data is not identified as a forgery and that the generated distribution of the generator thus approximates the real distribution through training. In the Variational Autoencoder, the distribution is optimised by maximising ELBO (Evidence Lower Bound).

Where are these models applied?

Deep Generative Models have extensive applications in the field of Deep Learning.

For example, they are used in the Image generation used. For this purpose, new, artificial faces with human facial features are created from human faces in the training data. This method can also be used in the film and computer games sector. A special form of application of generative models is the so-called deepfakes. In this case, media content is artificially created, but gives the appearance of being real.

Also the Creation of genuine-looking handwriting can be implemented by means of generative models. For example, one can also be generated on the basis of a textual description of a photo.

The achievements of deep generative models can also be used in medicine. For example, in the essay "Disease variant prediction with deep generative models of evolutionary data" referred to the fact that Predicting previously unknown disease variants with the help of generative models. can. Specifically, the article refers to the detection of protein variants in disease-related genes that have the ability to cause disease. The disadvantage of previous methods (primarily using supervised learning) was that the models were based on known disease labels and no new ones could be predicted. This is to be improved with Deep Generative Models.