Regular Expressions (RegEx)

What are regular expressions?

Regular expressions, also known as "regex" or "regexp", are a Sequence of characters defining a search pattern. They are used to match and extract text from a larger corpus of text and are often used in programming, the Data mining and used to manipulate and extract text. When working with text, RegEx can be used for data validation, search and replace and text analysis, among other things.

Character selection and character classes in RegEx

SymbolExplanation
[abc]finds every single character in the set (a, b or c)
[^abc]finds every single character that is not contained in the set (a, b or c)
{n}Matches exactly n occurrences of the preceding character or group
{n,}Matches n or more occurrences of the preceding character or group
{n,m}Corresponds to at least n and at most m occurrences of the preceding character or group
^corresponds to the beginning of a line
$Corresponds to the end of a line
.fits on every single character, except on a new line
*matches zero or more of the preceding characters
+Corresponds to one or more of the preceding characters
?corresponds to zero or one of the preceding characters
\dcorresponds to any digit (corresponds to [0-9])
\DFits any non-digit
\wfits any word character (alphanumeric characters and underscores)
\WCorresponds to any non-word character
\scorresponds to any space character (including tabs and spaces)
\SMatches any character that is not a space
|Corresponds to either the preceding or following character or group
()Groups the enclosed characters and applies the following quantifier to the whole group

Examples of regular expressions from practice

Validation of e-mail addresses

A regular expression can be used to check whether a given string is a valid email address by comparing it to a pattern that defines the structure of a valid email address.

Example syntax:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Parsing URLs

A regular expression can be used to extract the different parts of a URL, e.g. the protocol, the host name and the path.

Example syntax:

^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$

Extracting phone numbers from text

A regular expression can be used to identify and extract phone numbers from a block of text.

Example syntax:

^(?:\+\d{1,3}|0\d{1,3}|00\d{1,2})?(?:\s?\d){9,12}$

Removing HTML tags from a string

A regular expression can be used to remove all HTML tags from a string, leaving only the plain text content.

Example syntax:

\/?[^>]+

Search for specific words or patterns in a character string

A regular expression can be used to quickly search for a specific word or pattern in a string.

Example syntax:

\b(Word1|Word2|Word3)\b

Replace text

A regular expression can be used to replace all occurrences of a particular word or pattern in a string with another word or pattern.

Example syntax:

(Word1|Word2|Word3)

Extract data from structured files

A regular expression can be used to extract specific data from structured files such as CSV, JSON and logs.

Example syntax:

(\w+)=(\d+)

Validation of credit card numbers

A regular expression can be used to ensure that a given string is a valid credit card number by comparing it to a pattern that defines the structure of a valid credit card number.

Example syntax:

^(?:4[0-9]{12}(?:[0-9]{3})?|[25][1-7][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$

Tokenisation of a sentence

Regular expressions can be used to break down a sentence into words and punctuation marks.

Example syntax:

\w+|[^\w\s]+

Data extraction from natural language text

A regular expression can be used to extract specific information such as names, dates, prices, etc. from natural text.

Example syntax:

\b

Redundancy (information theory)

What is redundancy?

The term redundancy comes from the Latin word "redundare" and means "overflowing" or "present in excess". In computer science, redundancy refers to Excess data, the absence of which would not create a loss of information. Basically, a distinction is made between intended and unintended redundancy.

What are examples of redundancies in computer science?

Information transmission

In the transmission of information and messages, the Redundancy for the detection of errors. The part of the message that does not contain any relevant information is marked as redundant. It is therefore Additional bits that, for example, represent functions in the message can. Higher redundancy also allows errors to be corrected. Information lost in a transmission can be restored under certain circumstances. However, this depends on the fault tolerance of the application. For example, IP telephony is more fault-tolerant than transactions at a bank. Error tolerance is measured by the Hamming distance. This can be used to determine differences between character strings. For example, binary coded numbers are compared with each other by XOR operation and the deviating digits are counted.

The redundancy of the code is calculated from the difference between the average source code word length L(C) and the entropy H(X) of the information.

The redundancy of the source is determined from the difference of maximum entropy Hmax(X) and entropy H(X).

Coding

In coding theory, one divides into distribution redundancy and binding redundancy. The Distribution redundancy refers to the different probabilities of occurrence of characters of an alphabet. Binding redundancy on the other hand, means that certain characters are more likely to occur after certain other characters. For example, the letters "c" and "h" have a lower occurrence than other characters, but when they do occur, it is usually as a combination.

The aim of source coding is to eliminate superfluous data in order to make maximum use of the information channel. However, relevant information of a message must be preserved. A Example of a low-redundancy coding is the Huffman coding. Here, characters that occur more frequently in a source are represented by fewer bits than rarer symbols. With the help of a code tree, the characters are assigned to their code words. Decoding is done bit by bit, starting at the root. This enables lossless compression and transmission.

Databases and database structures

In database systems redundancies are undesirable, as they lead to data anomalies. If several identical data sets exist, it may not be clear which data should be accessed. It also complicates the consistency and maintenance of the Database. In addition, redundant data can consume a lot of storage space.

An example is the contact details of a person when buying from an online shop. If name, address and customer number occur with every order, these are redundant data records.

Through Normalisation of database schemas excess information is reduced. Relational database systems represent data in tables. Data sets from different tables can be linked to each other by their attributes. In normalisation, the data is put into atomic form and each table column is constructed to contain similar values. In addition, all non-key attributes must be independent of the primary key.

However, sometimes redundant data in a database is necessary, such as Key redundancies. Keys are identifiers that uniquely identify data sets. Redundant information is also deliberately preserved when the effort of normalisation would be too great. A Denormalisation then serves to improve the running time.

Random Forest

What is Random Forest?

Random Forest describes an algorithm in the field of machine learning or the artificial intelligence, which is for Classification or regression tasks can be applied. Classification or categorisation is about categorising or assigning a variable to a particular class. Regression, on the other hand, aims to estimate values of a variable based on its dependence on other variables.

The term Random Forest was introduced by statistician Leo Breiman and is based on the use of decision trees. By creating many random decision trees, a "random forest" of trees is created.

How does a Random Forest work?

To create a forest of trees, many individual decision trees must first be generated. These Creation is uncorrelated and randomised. Each tree consists of several branches/nodes, which finally result in an end point/leaf/class after several levels. A classifier assigns the data object to a class, which is then classified again in the next branch until the object reaches an end point.

To prevent decision trees from correlating with each other, the so-called principle of bagging (short for bootstrap aggregation) is applied. For this purpose, the decision trees are created using the Training data several times with different distributions. This variance of the respective decision nodes is to exclude a correlation of the decision trees to each other.

After creating the defined number of decision trees, the algorithm works based on the ensemble method by considering multiple decision trees for prediction. This method has the advantage over using a single decision tree that the decisions of a large number of predictors can counteract outliers and thus increase the reliability of the result. Thus, the prediction of a random forest regressor corresponds to the average of the predictions of the individual decision trees.

Random Forest basically belongs to the category of the so-called Supervised Learning (supervised learning). In this type of machine learning, the algorithm's training data is labelled, meaning that the input data is already mapped to the correct target data. Based on this, the system is supposed to learn to predict new data correctly.

In which software can a random forest be implemented?

Among other things, the method can be used in Scikit-learn, R programming languageH2O or Weka.

  • At Scikit-learn is a Python library that is mainly used for classification & regression algorithms as well as visualisations in the field of machine learning.
  • The Programming language R is classified as an interpreted language, was developed for static calculations and is very widely used for statistical calculations in both science and business. The name R can be traced back to the first letter of the first name of its founders Ross Ihaka and Robert Gentleman as well as to the simplicity of the programming language S, on which the syntax of R is strongly based.
  • H2O is an open-source software of the company H2O.ai and is mainly used for algorithms in the field of statistics and machine learning. The software can also be operated in Microsoft Excel via an API, for example. During the calculation of the algorithm, approximate results are displayed so that parameters can still be changed during the calculation process. The visualisation of the method is generally one of its advantages.
  • Weka (Waikato Environment for Knowledge Analysis) was developed by the University of Waikato in New Zealand and offers solutions for classifications and in the Cluster analysis also areas of application in neural networkswhich can be combined with the application of Random Forest.

Reasoning System

What is a Reasoning System?

A reasoning system is a software system that generates conclusions from an available knowledge base and uses logical techniques such as deduction and induction. Reasoning systems play an extraordinarily large role in the implementation of Artificial intelligence and in knowledge-based systems. In principle, all existing computer systems are such systems, because they all automate certain types of logic or decisions.

Normally, however, this term is used for systems in which a more complex type of reasoning system is used. For example, systems that implement direct reasoning such as VAT or the customer discount are not considered such systems in the strict sense, but rather systems that make logical inferences about medical diagnoses or mathematical theorems. There are two modes in which reasoning systems operate: interactive mode and batch mode. Both modes can perform the reasoning process with user guidance to determine the best answer.

Types of Reasoning Systems

There are different reasoning systems that have become established in different areas:

Clinical or professional reasoning

In clinical reasoning, the following areas can be distinguished:

  • Scientific Reasoning (SR): subject-specific, profession-specific background knowledge
  • Interactive Reasoning (IR): is in interaction with the other individuals and thinking takes place on the relational level
  • Conditional Reasoning (KR): this concerns ideas about the future and also conditions under which possible futures could occur.
  • Narrative Reasoning (NR): here, thinking takes place in stories and in relation to persons and institutions.
  • Pragmatic Reasoning (PR): the ability to act according to pragmatic considerations.
  • Ethical Reasoning (ER): reasoning determined by attitudes, stances or values.

Case-based Reasoning System

A Case-based Reasoning is case-based reasoning with a case base (case memory) and an imitation of human behaviour, where the solution to a given problem is guided by the solution to a similar and previously solved problem. Case-based reasoning is an approach to modelling human thinking. With this approach, intelligent systems can be built. For this purpose, experiences made (all cases) are stored. These cases are used to solve new tasks. The task classes of CBR systems include the analytical tasks of classification, diagnosis, evaluation, decision support and prediction, as well as the synthetic tasks of configuration, design and planning.

Machine learning systems

Machine learning deals with the computer-based methods for acquiring new knowledge and new skills as well as novel ways of organising existing knowledge. Both symbol-oriented and connectionist methods are understood under the term machine learning. The task of learning systems is to enable the system to perform the set tasks (global or concrete targets) progressively better after repetition than before. The improvement of the system's performance can be achieved by applying new or modified methods and knowledge. The tasks can finally be performed with improved quality (faster, more accurate, safer and more robust).

Rule-based system

What is a rule-based system?

A rule-based system is a useful knowledge-based system that allows rule-based reasoning. Such Rule-based systems consist of a database of facts (fact base), a set of rules (rule base) and a control system.which is equipped with a rule interpreter (inference engine or business rule engine).

The rules are constructed according to the if-then-else principle. The IF part is called the premise and the THEN part is the conclusion. The control system is to make an identification of appropriate rules and apply selected rules and update the database. The selection mechanisms are data-driven or goal-driven.

Rule-based systems form the basis of expert systems. Rules are managed in a business rule repository, which is part of a business rule management system.

What are the applications for rule-based systems?

It comes to the increasing Use of rule-based systems in production planning and production control. These systems are used in particular in industries with a wide variety of consumer and investment goods. They are used in the furniture industry, in mechanical engineering, in the automotive industry and in the electrical industry.

Product configurators know what dependencies there are and inform about them. For example, there are certain combinations of features. A customer can order a convertible, but it cannot have a "sunroof". When ordering a fully automatic "air conditioning system", the vehicle also needs a stronger "battery" at the same time.

Rule-based systems are also used for the distribution of worldwide vehicle orders. For example, there are rules for vehicle and aggregate plants of car manufacturers. The rulebook of a car manufacturer has thousands of product and production rules.

How are rule-based systems structured?

Rule-based systems are the most common type of knowledge-based systems (expert systems). The components are a rule base (set of rules) and an inference mechanism (inference engine). The inference mechanism determines which of the rules to apply and there are several possible strategies that can be used. For example, there is either forward chaining or backward chaining of corresponding rules.

What are the rules in a rule-based system?

The rules are simply formalised conditional sentences. They have the form:

If (if) A, then (then) B

This is where the meaning comes from:

If A is true (fulfilled, proved),
then conclude that B is also true.

A and B are statements. The "if"-part formula of a rule is called the premise or antecedent of the rule. The "then" part formula is called the conclusion or consequence. As soon as the premise of a rule is fulfilled, the rule is applied.

If this rule always applies, it is called a deterministic rule. If the consequence of a rule is connected with a corresponding action, then we have a production rule. These rules are often used in corresponding production systems for control purposes.

The rules are a good compromise between an understandable representation of knowledge and formal requirements. In cognitive science, rules are seen as components of information-processing processes.