Ontology

What is ontology?

In computer science, the term ontology describes an area for the unambiguous representation and communication of knowledge in this field. In addition to a uniform terminology, this knowledge also includes the use of relationships, hierarchies, rules and terms.

The Aim of an ontology in computer science is aimed at ensuring the clear and Clear provision of information and knowledge without room for interpretation through this "common language". The implementation of this network of relationships in this sub-area takes place primarily in information systems, artificial intelligence and in Databases Application.

As early as the beginning of the 1990s, the concept of ontology was used in connection with the artificial intelligence and from there it spread to many areas of computer science.

Term development

Your Ontology has its origins in philosophy, where it refers to the "doctrine of being".. Also in the definition of terms in the philosophical sense, one of the questions asked is how so-called entities (this describes a being or a concrete or abstract object) can be categorised or related to each other. Often the term metaphysics is used synonymously with ontology. The term metaphysics goes back to the Greek philosopher Aristotle and describes, according to the definition of the term, that "something" that comes after physics.

On the basis of this definitional space, questions arise about being, nothingness, finiteness and infinity, among others, which are also considered in all religions. In addition to Aristotle, the German philosopher Immanuel Kant also dealt extensively with metaphysics. While the term has its origins in philosophy, other scientific disciplines, such as psychology, sociology and medicine, are increasingly becoming aware of it in their research.

Examples for ontologies

An example of the application of knowledge representation in computer science is the so-called Semantic Web is the idea of the World Wide Web. This idea from World Wide Web founder Tim Berners-Lee is based on extending the conventional World Wide Web so that the meaning and significance of information can be clearly assigned.

A further background to this endeavour is to Facilitate communication or work between people and machines. In addition to the implementation of uniform rules, data models and syntax, the development of the ontology language Web Ontology Language (OWL) also provided a remedy. A concrete application example is the unambiguous and conflict-free meaning of the word "Washington" in the respective context. Since Washington, in addition to a city, can also represent a federal state, a name or a warship, among other things, a more detailed definition is necessary.

Another possible application for ontology in computer science is in the field of artificial intelligence and is used primarily for Machine-interpretable knowledge representation. With the help of the normalisation, rules and specifications of the ontology, an inference machine can draw logical conclusions.

In medicine, for example, it is found in the Gene Ontology Application. The aim is to provide and further develop databases that provide uniform information on the function of genes in biomedicine.

In the field of psychology, the representation of relations is particularly important in the sub-area of the Psychosociology widespread. With the help of these, attempts are made to grasp and categorise social phenomena such as groups, families, bonds, but also personalities with uniform terms, or to describe interactions.

Differences with taxonomy and epistemology

Ontology vs. taxonomy

While an ontology focuses on the network of connections and relations, a taxonomy describes structured hierarchical relationships. The term taxonomy is derived from the ancient Greek and translates as law of order. Taxonomy originated in the natural sciences and is still used extensively there to describe races, genera and orders. Taxonomy is also used in computer science to represent hierarchical relationships and inheritance.

Ontology vs. epistemology

The term epistemology also comes from ancient Greek and describes the teaching of science. Epistemology is often also paraphrased as epistemology and raises the question of how knowledge comes about and how knowledge is justified. While epistemology deals with the basic acquisition of knowledge, ontology focuses on the nature of being or reality.

Open Data

What is Open Data?

Open data is data that can be used, shared and processed by the general public. Open Data often comes in the form of a demand and is strongly promoted by the Open Knowledge Foundation. In summary, the said foundation defines open data as follows:

  • A Duplication of the data must not cause any costs. This means, for example, that the data formats in which the files are saved are chosen appropriately and data sets are basically complete; one can say that it is "made easy" for data consumers to share the content.
  • In addition to sharing, the type of reuse also plays a role. In order to To merge data with other sources, they must be provided in a format that is interoperable (the CSV or JSON format has become established for this). At best, the provider offers interfaces based on common protocols such as SOAP or REST to enable the Databases are readable by man and machine.
  • While the above-mentioned technical criteria are placed on open data, in contrast, no social conditions may be attached to it. Everyone must be able to use, share or further process the data - certain groups of persons or fields of application must not be excluded.

It is not without reason that the federal government and the Länder are committed to initiatives such as Open Government for Open Data. Public offices can work more efficiently, private companies benefit from easy access to knowledge and social security increases due to information transparency.

Legal backing through the Open Data Act

So that this progressive concept is actually implemented in practice, The first Open Data Act came into force in 2017.. Due to this legal situation Authorities obliged to provide their data in machine-readable form. In connection with the Federal Government's Open Data Strategy, Germany thus laid the foundation for a solid ecosystem. The Open Data Act is being further developed together with Austria and Switzerland and is intended to ensure even more responsible, innovative and public benefit-oriented data use in the future.

The most important German Open Data databases

  • The nationwide metadatabase GovData contains, in addition to the administrative data itself, a lot of information about the data, e.g. who created it, when and where.
  • GENESIS Online, the database of the Federal Statistical Office, contains a broad range of topics of official statistics and is categorically deeply structured. As for the databases Regional database Germany and Municipal education databaseIf the data are based on GENESIS-Online, various interfaces are available to process the data efficiently.
  • The Open Data Platform Open.NRW is intended to serve as an information portal and, under the guiding principle of "Open Government in North Rhine-Westphalia", provides a constantly growing database consisting of administrative data on projects in the state.
  • The Ministry of Regional Development and Housing in Baden-Württemberg relies on geodata for Open Government and provides the following information Geoportal Baden-Württemberg a comprehensive tool for interested associations, business representatives and citizens. A further point of contact for geographic reference data is the Geoportal Hesse data. Geodata are also mainly provided in the state of Lower Saxony. The State Office for Geoinformation and Land Surveying of Lower Saxony (LGLN) provides the platform Open Geo Data ready.
  • Among other things, you can find interesting information about the German capital Berlin on the portal Berlin Open Data. In addition, data sets from thematic fields such as education, health or transport can be viewed. The state of Schleswig-Holstein follows a similar approach. Here, too, the web application can be used to Open Data Schleswig-Holstein open data on socially relevant topics such as culture, energy or the economy.

Open Source

What is Open Source?

The term open source (OS or OSS for short) denotes software whose source code is publicly accessible and can be viewed, changed and used by anyone. Most open source software can be used free of charge.

The opposite of open source is closed source. The source code of this software is not publicly accessible and may not be changed, used or passed on. It is distributed commercially with the help of licences.

Difference between open source and freeware

Freeware refers to software that is made available free of charge by the author. However, the source code is not freely accessible and may not be modified or distributed. Therein lies the difference to OSS.

Advantages of Open Source

Low costs

The Most open source software is free of charge accessible. And even paid OSS is for the most part much cheaper compared to closed software alternatives.

Independence from commercial providers

Software from commercial providers pursues their corporate goals. Thus, there is a certain dependency, which can cause problems especially when the software no longer brings in enough profit. In most cases, the software is no longer supported or offered by the manufacturer over time and the customer has to look for alternatives.

With open source, this problem does not exist to the same extent because no, or only low monetisation is available.

Individuality

Since the Code editable at any time it can also be customised at any time. It is possible to delete unnecessary functions or add missing ones. In this way, individually suitable solutions can be sought and advanced.

This is possible either with the help of our own expertise, with the support of the community or with commercial experts.

Compatibility

The exclusion of competitors through their own data formats and systems tends to be a hindrance to open source, therefore much emphasis on interoperability (a system has the ability to cooperate with other systems). As a result, there are much fewer compatibility problems with open source than with closed source.

Security

This is because the code is checked several times by many developers, some of whom are very good, errors and security gaps are quickly noticed. With closed source, it usually takes longer.

Disadvantages of Open Source

Dependence on an active community

There is no claim to a guarantee or support against a manufactureras is the case with closed source applications. Therefore, with open source there is a certain dependence on an active community in terms of support and further development.

High training and knowledge expenditure

OSS is usually not as well-known and beginner-friendly for laypersons as the widely used commercial products. Thus the use of OSS often requires more familiarisation, training and expertise.

What is popular open source software as a business solution?

For ETL, Reporting, OLAP/Analysis and Data Mining

Pentaho by Hitachi Vantara offers a collection of Business Intelligence-software, which are free of charge in the basic version. Solutions are offered for the following areas ETLReporting, OLAP/Analysis and Data mining provided.

As an ETL tool, Pentaho Data Integration (PDI for short) offers connection possibilities to various Databases. Through further plug-ins, connections to other systems are also possible, such as to SAP with the help of ProERPconn and to Navision with the NaviX Table plug-in. Also Big Data-Pentaho Data Integration counts processing as one of its strengths.

Pentaho BI Suite offers one of the few business intelligence solutions in the open source sector.

For Data Virtualization

Data Virtualisation can be seen as the opposite of the ETL process, as the data remains in its original systems and the virtualisation component accesses it directly and makes it available for use.

Denodo Express from Denodo Technologies Inc. offers an open source solution for data virtualisation. It connects and integrates local and cloud-based data sources, as well as Big Data, with each other. This data is made available to end users, enterprise applications, dashboards, portals, intranet, search and other tools.

OpenLooKeng from Huawei has also been available as open source since mid-2020 and offers uniform SQL interfaces for accessing different data sources.

For data labelling

Data labelling is essential for Machine learningIt provides the existing data with the required characteristics, for example, whether a picture shows a person or not.

There are several data labelling tools available as open source. Some are specialised in certain file formats and others can process all of them.

Examples of data labelling tools for images only:

  • bbox-visualizer
  • CVATT
  • hover
  • Labelme
  • Yolo-mark

Examples of data labelling tools for text only:

  • dataqa
  • Hubdoccano

Examples of data labelling tools for audio, images and text:

  • awesome-data-labelling
  • Label studio

OpenGPT-X

What is OpenGPT-X?

OpenGPT-X describes a European project in which a large language model is to be developed. Language models are used, for example, for Chatbots but also for writing texts, understanding complex texts or conducting conversations. GPT stands for "Generative Pretrained". Transformer", the following "X" represents a variable for the version.

A consortium of well-known European companies, institutes and universities is participating in the project under the leadership of the Fraunhofer Institute. It was created, among other things, to build European sovereignty in the field of major language models and to minimise dependence on the USA and China. With GPT-3 (Generative Pretrained Transformer 3) was developed by the company OpenAI in May 2020, the major third-generation language model was introduced in the USA. In June 2021, China responded to the pioneer from the USA with Wu Dao 2.0 ("Understanding the Laws of Nature") in its second version.

What are the goals of the European joint project?

The The primary aim of the project is to preserve European digital sovereignty and independence with its own AI language model. The European properties in the area of Data protectionvalues and linguistic diversity are to be linked to one's own Model be taken into account.

The OpenGPT-X project is designed to enable data-based business solutions in the GAIA-X ecosystem. GAIA-X is a project to create a networked and secure data infrastructure in Europe to use and share data in a decentralised way. The name Gaia is derived from Greek mythology and describes a deity who is regarded as the personified earth.

OpenGPT-X is responsible for building a node for large AI language models and innovative language application services in the GAIA-X project.

What differentiates OpenGPT-X from other language models such as GPT-3?

In OpenGPT-X, special attention is paid to the European context of the AI language model. This primarily concerns the Integration of the many European languages, European ethical values as well as culture.

In addition, OpenGPT-X is also intended to meet European standards of data protection. These specifics are often cited as points of criticism of alternatives such as GPT-3 or Wu Dao 2.0 and are to be improved by the European solution and thus make the Protect the economic interests of "Europe as a business location.

Furthermore, this approach also allows governmental and legal concerns to be taken into account, such as the observance of European values, the European cultural context and regulations in the development of language applications. OpenGPT-X is also intended to be used in the decentralised Cloud solution GAIA-X and thus form a building block within the European data infrastructure.

Overfitting

What is overfitting?

Overfitting is a particular situation in the use of artificial intelligence where a correction of a model on a certain given data set is referred to. Statistically, too many explanatory variables are used to specify a model. Thus, overfitting can be compared to a human hallucination, where things are seen that are not actually there.

In machine learning, overfitting is undesirable because the algorithm recognises patterns in the data set that do not exist and builds its learning file on them. Machine learning or also Deep Learning Algorithms are to derive rules that can be successfully applied to completely unknown inputs and are to provide an accurate prediction.

An over-adapted Algorithm can unfortunately also deliver incorrect results due to incorrect inferences. In the algorithm, the data is trained so often that this data is practically learned by heart. Unfortunately, however, no useful result can be delivered with a new input. Overfitting usually occurs when there are significant gaps between the training and testing errors. Overfitting is favoured by some factors. The number of observations and measurement points plays a major role in model building.

A selection of the data set decides on the possibility of making assumptions derived from these data for inferences about reality. If one should determine certain rules or trends from the available data, then the data set must also contain suitable data for this. Overfitting is also favoured by model misbehaviour, with a bias in some sample selection. This can also be due to a bias in data collection or evaluation. It is also possible that the training was too intensive, because an overtrained system can handle existing data absolutely well, but unfortunately not new and unknown data.

How to avoid over-adaptation?

There are some techniques used in predictive data mining to avoid overfitting (with neural networks, classification and regression trees). This can be used to control the model complexity (flexibility).

To avoid overfitting, one can plan a sufficiently large time window. Thus, one needs time for a truly unbiased and thus representative sampling. Factual preliminary considerations are important. It must be clarified which variables are relevant. The data set should be divided into test and training data sets.