Entity (computer science)

What is an entity in computer science?

An entity in computer science refers to a Unique and identifiable information object. These can be both real and abstract objects. The term "entity" comes from Latin and means "something that exists". In the IT industry, the term is often used in areas such as database management, programming and Search engines is used. Entities are represented by a combination of attributes and relationships and can be represented graphically in entity-relationship models. Together with entity types, attributes and relationships between entities, they are important components of data modelling.

Types and representation

An entity is usually represented by a Combination of attributes and relationships displayed. Attributes describe the properties of an entity, such as the first name, surname or place of residence of a person. The actual values of these attributes would be, for example, first name "Max" and last name "Mustermann". The interaction of attributes and attribute values creates information about the properties of an entity. The Entity typealso referred to as entity class, describes a specific group of entities. For example, a person with the attribute values "Max Mustermann" can belong to the entity type "customer" or "patient".

There are different types of entities used in computer science. One of them is the weak entitythat is dependent on another entity. An example of this is an order line that depends on an order.

Another type of entity is the strong entitythat is independent of other entities. An example of this is a customer that is not dependent on an order.

The Entity Relationship Model

The entity-relationship model is a widely used instrument in the generation of a data model for a Database. It abstracts the types of entities, their attributes and the relationships between them to create a graphical entity-relationship diagram that forms the basis for the actual data model. The model is composed of the diagram and the description of the elements used.

Relational databases use the entity-relationship model by equating entity types with tables and table rows with individual entities. The model is used in the conceptual phase of database development, as well as in the implementation of relational databases.

The Entity Relationship Model was introduced by Peter Chen in 1976 and has been an established de facto standard for data modelling ever since. It is an important part of database development and helps developers create a clear and understandable data model that captures the properties, attributes and relationships of entities in a database.

Examples for entities

An example of an entity in a Online shop application is a product. The product entity has attributes such as product name, description, price and stock. It also has relationships with other entities such as category and manufacturer.

An example of an entity in a Human Resources Management Application is a staff member. The employee entity has attributes such as first name, last name, address, date of birth and position. It also has relationships with other entities such as department and supervisor.

An example of an entity in a Financial management system is an account. The account entity has attributes such as account number, account name, current balance and currency. It also has relationships with other entities such as customer and transactions.

Decision tree

What is a decision tree?

A decision tree represents a Procedure for Classification From objects on the basis of which decisions can be made. Decision trees are mainly used in the field of stochastics for the representation of conditional probabilities, but also in the area of machine learning and in decision theory.

The graphical illustration of decision trees can be implemented with so-called tree diagrams, which, starting from the root node up to the leaf, can clearly display all decision options.

The clear interpretability and the transparency of the presentation options are among the advantages of decision trees. In addition, they can be expanded relatively flexibly. The good clarity of small trees can, however, turn into the opposite for human observation in the case of large and complex trees. Large trees, especially with a great depth, can become confusing relatively quickly and increase the computational effort for analysis. Furthermore, it is often difficult to represent and evaluate all attributes or decision options of each individual node.

Methodology

Every decision tree always has its origin in the root node mentioned above. Starting from the root node, the tree branches out to further nodes, which finally end in a leaf or the final classification. A distinction is made between decision nodes and chance nodes. Decision nodes are shown as rectangles in the graphical illustration and represent a query about an attribute that decides on the further course along the tree and the subsequent node.
Chance nodes are represented as a circle and describe the further course of the tree with a certain probability. The end of the decision tree is called a leaf, often represented as a triangle, and represents the result of the respective classification task. If each node of the tree has exactly two attributes or two possible branches, it is called a binary decision tree. Each decision tree can also be represented as a binary tree.

In the context of the analysis or classification of data objects, the process always starts at the root node. Subsequently, an input vector is queried for each decision node and compared with the attributes of the decision node. If the vector matches an attribute, this represents the path to the next subsequent node in the decision tree.
In the case of random nodes, the further path and thus the next successor node is determined on the basis of the existing or assumed probabilities. According to this methodology, the process continues until the end of the decision tree or the last leaf is reached. This immediately represents the result of the tree or the respective classification of the data object. This can be a monetary, numerical, cardinal or nominal value, which can be recoded beforehand for better mathematical processing if necessary.

Example of a decision tree in machine learning

A concrete example of the application of decision trees in the field of machine learning is the methodology of the Random Forest represents. In this algorithm, founded by the statistician Leo Breiman, decision trees are used to solve Classification and regression tasks applied.

As the name Random Forest suggests, not just one decision tree is used, but a large number. This creates a "forest" of decision trees. With Random Forest, the trees are generated randomly and uncorrelated. During the creation of the trees, so-called Bagging care was taken to ensure that the individual trees do not correlate with each other and thus negatively influence the quality of the result.

The input data is applied to several decision trees when applying the algorithm for the prediction by the so-called ensemble method or ensemble learning in order to achieve a high-quality result. This procedure has the advantage over considering only a single decision tree that outliers in the individual trees are counteracted and the result thus corresponds to the average of the individual trees.

Entropy (information theory)

What is entropy?

In information theory, an entropy is a measure that indicates an average information content of the output messages for a certain message source. The information-theoretical understanding of the term entropy goes back to Claude Shannon. In general, the more characters received from a given source, the more information is collected. Entropy, according to Shannon's original intention, should be used as the measure of a required bandwidth of a transmission channel. However, he generalised the findings and devised a state of entropy that is generally accepted as the measure of information content. If this is small, then the information text contains many redundancies or even statistical regularities.

The main branches of Shannon's information theory include the encoding of information, the quantitative measure that applies to the redundancy of a text, data compression and cryptography. Information theory is a theory that aims to quantify and qualify the information content of a data set.

What aspects arise in computer science?

Shannon understood the Entropy in computer science as a measure of information and thus he could combine thermodynamics with information theory. This resulted in new aspects and methods:

  • The cross entropy is a measure of model quality. It calculates the total entropy between the distributions. The cross entropy is usually used in the Machine learning used as a loss function. The cross entropy can be understood as a measure that originates from the field of information theory and is based on entropy states.
  • The Kullback-Leibler divergence is a certain distance measure between two different models. The intrinsic measure of difficulty and quality is applied in machine learning.
  • The entropy change (information gain) is used as a criterion in feature engineering.
  • Cross-entropy minimisation is used as a method of model optimisation.
  • Another role is played by conditional entropy, which clarifies how much entropy remains in one random variable given knowledge of another random variable.
  • The joint entropy makes statements about how many bits are needed to correctly encode both random variables.
  • Ultimately, there is a whole entropy algebra where one can calculate back and forth between marginal, conditional and joint entropy states.

How is entropy used in machine learning?

Entropy in machine learning is the most commonly used measure of impurity in all of computer science. The entropy state is maximum when two classes reach 1.00 and these classes occur within a superset with identical frequency. If one class gains quantitative dominance, the probability of such a class increases equally and the entropy state decreases. In machine learning, entropy tells us how difficult it is to predict an event.

Efficiency of an algorithm

What is the efficiency of an algorithm?

An algorithm is then particularly efficient if it requires as few resources as possible, i.e. hardly any computing time and memory, to solve a specified problem. The efficiency of an algorithm can be measured in various ways: as a rule, the runtime efficiency and the storage efficiency are used as the most important key figures. This can be measured particularly well in absolute and relative terms and can therefore also be interpreted and evaluated.

In a concrete implementation of a particular programming language, efficiency can be determined as a function of hardware and the input data. However, it is also possible to create an efficiency evaluation independent of the evaluation variable and the implementation. The Landau notation can be used to determine the efficiency of an algorithm.

On what concept is this efficiency based?

For precise mathematical efficiency analyses, the underlying data and a calculation model are needed. It is clarified which data types exist, how the data is stored and which operations are permitted. Then it is determined what time a certain operation on a defined data type requires. A random access machine with simple but "unlimited" memory is used for this purpose. Efficiency is represented as storage efficiency (storage effort), subjective efficiency (acquisition effort) and runtime efficiency (time effort).

Algorithms can be fully analysed in terms of correctness and efficiency. Sorting, searching and optimising can also be considered. Data structures have different efficiencies in storage and access. Decisive for the quality of an algorithm is above all the consideration of large inputs. For practical use, it must always be considered whether the input is large or rather small. In theory, however, asymptotic behaviour is important.

Efficiency is specified independently of the machine in the number of calculation steps. One can break this down according to the number of comparisons and additions or other arithmetic operations. The number of memory locations can also be specified. Thus, the concept of the efficiency of an algorithm can be considered machine-independent. In theory, a Turing machine is often used that has a tape on which input and output can be recorded. The tape is infinitely long and the individual calculation steps are considered.

Edge Case

What is an edge case?

An edge case is also called a boundary case or edge case. It describes a scenario in which a parameter is in an extreme situation. These extremes can be either in the minimum or maximum range.

In most cases, this results in problems and malfunctions, so an edge case should be found, tested and fixed. Unit tests are usually used for this in software development. These test the boundary conditions of a method, a function or algorithms.

By controlling the correct operation within the extremes (the edges), it can be assumed that the states in between will also work. This means that not every case that might occur needs to be controlled. If, for example, a function is to add 2 numbers, it is sufficient to test it with numbers that are as small and as large as possible. If everything works without errors, it can be assumed that the function will also work without problems for all numbers in between. This approach keeps the process of testing as efficient as possible.

What is the difference between an Edge Case and a Corner Case?

In contrast to the Edge Case, where an extreme situation only occurs with one parameter, the Corner Case defines a situation where several conditions are in extreme ranges (minimum or maximum).

The difference therefore lies in the number of operating parameters, which are at the extremes and thus cause problems.

An example of an edge case: A printer has a low ink level and no longer prints the text cleanly. The ink level property is at the minimum and the printer cannot do its job properly.

In the corner case, on the other hand, several conditions are in the extreme: the printer has a low ink level, the brightness is set to maximum and the printer is standing on a shaky surface. It is only the interaction of these 3 parameters that causes the printer not to function properly. If each condition occurred individually, the printer would have no problem. The initial situations in both cases are therefore different, but they usually lead to the same result, namely a malfunction. Since it is often not clear whether one parameter alone or several in combination cause a problem, edge case and corner case are often used as synonyms for an error that occurs.