Big data holds a great opportunity to increase a company's turnover and is therefore a universal cornerstone for cross-sector success. In order to be able to deal with large amounts of data, a data collection is needed that is suitable for analysis purposes. This is where data warehousing comes into play, enabling companies to do just that. In addition to data warehousing, there are alternative methods today, such as data lakes or data mesh, which can better meet the needs of the user. So which form of data collection is most efficient at which point?
The different types of data collections and their evaluation have changed and developed greatly over the last decades. Due to new developments that work with cloud-based raw data, the long-established data warehouse is in danger of losing touch. However, the cloud also brings new developments from which data warehousing benefits. In order to be able to evaluate this scenario and subsequently make an informed decision, it is therefore important to be aware of how data warehousing and possible competing methods work.
Emergence of Data Warehouse
The invention of the data warehouse in the 1980s laid the foundation for information management in large companies. At the beginning of digitalisation, the desire of companies to centrally collect and analyse data in a larger context grew. In this way, internal as well as external decisions can increasingly be made on the basis of facts. This leads to various advantages in different areas of the company, from which consumers can also benefit.
How does a data warehouse system work?
The word generally stands for a database system based on the data warehouse concept. It forms the basis for analysis-oriented information management. For this purpose, data is fed into a data management software according to the waterfall model. Data is extracted from local memories into a central database. Then the data is transferred into a relational or multidimensional data model. On this basis, the data set can then be evaluated centrally and individual operational systems can be relieved of their function and supplemented. However, high costs for hardware and software licences as well as duration and effort limited and hampered the sole success of data warehousing in the long run.
Advantages of a data warehouse
Even though newer developments such as data mesh can already be observed, data warehousing still has an understandable raison d'être. The American computer scientist and author Bill Inmon describes the following sectors that can benefit from data warehousing:
- Theme orientation
- Time orientation
- the simplified characterisation
Meanwhile, however, other forms of data collection and analysis offer advantages in these areas. In addition, data warehousing also harbours potential problems that need to be discussed.
Despite various advantages, the waterfall approach suffers from limitations in use, as large storage capacities are needed, in combination with software licences that have to be purchased. In the 1980s, data records were still stored locally, which changed with the Cloud computing has largely changed today. At that time, however, large storage capacities caused considerable financial burdens, which at first contradicted the implementation of goals such as cost reduction and revenue increase.
A data warehouse also requires globally coordinated key figures so that analyses can be carried out effectively. This makes for a large coordination and specification phase, which is time-intensive. This is followed by the implementation, which often reveals errors and inconsistencies, which in turn can lead to difficult usability or further costs.
How do Data Lakes work ?
In 2010, a new world of data collection and analysis, apart from data warehousing, opened up through the "data lake" construct. For this, as much internal and external data as possible is collected, which is only merged and classified in the application case. This consequently requires significantly less storage capacity due to less complex data.
Advantages of Data Lakes
The stored data is unformatted raw data, which requires significantly less storage space and enables flexible and agile access. In this way, "big data" can be processed more efficiently. Meanwhile, various cloud storage providers offer the evaluation and analysis of data pools stored there.
Effort & flexibility
Storing raw, unclassified data also makes it easier to include the latest data from a data collection. In comparison, the classifications that have to be carried out regularly in data warehouses decelerate the analysis process.
Limitations of Data Lakes
If analyses are carried out with data from a data lake, it is difficult to exclude parts of the raw data because the data have not yet been classified. This means that the entire data set is always worked with from the ground up, even if data scientists can make targeted selections.
Since the data collection for data lakes is now mostly located in a cloud, it is essential that the Security of this cloud can be guaranteed. However, well-known providers today operate at a high security standard.
Dealing with data lakes without BI tools or modelled access layers requires software specialists who form the interface between IT and business. Accessibility is therefore only possible to a limited extent without optimisations.
According to the latest developments, data pools are now used for the even newer "data mesh" approach. This is a development in which different data lakes are combined according to subject domains and used for analysis purposes. This targeted structuring of unclassified raw data, as found in data lakes, leads to better usability of the different data pools and is currently considered a very promising approach.
Advantages of Data Mesh
The decentralised data mesh approach offers improvements in areas of organisation and scalability, leading to more transparent responsibilities in implementation. To this end, efforts are made to keep the cooperation between data collection and data processing as close as possible in order to achieve more quality.
Limitations of Data Mesh
For the data mesh paradigm to work optimally, it requires improved organisational structures and clear lines of responsibility. In addition, there should be clear information about the ownership and origin of individual data so that no ambiguities arise.
Data pools vs. data warehouse
With new storage options in the cloud, the importance of Apache Hadoop as the basis for many data lakes dwindled. The complementary architecture patterns also led to a very different classification of data lakes and data warehousing, not least due to the technical characteristics of the Apache Hadoop stack components.
Due to new technical developments, it is now possible to process classic eDWH on the same technological basis as Data Lakes. The complex issues to be solved in an eDWH remain the same regardless of these factors. Examples are:
- Interface connection
- Data modelling
- Key figure definition
- Metadata description
- Process and responsibility definition for governance tasks
This development modernised the user-friendliness of data warehousing and facilitates the combined use of data warehouses and data lakes.
The future of data warehousing and data lakes
Data warehousing has benefited from the technical development since the 1980s, which has definitely increased its competitiveness with data lakes. Today, data warehouses continue to make sense for certain companies or institutions due to new forms of storage such as the cloud and are mostly used in combination with data lakes.
What continues to make the data warehouse concept attractive is its accessibility to all employees of a company, whereas dealing with data lakes requires specialists. This Big Data Software specialists are in high demand on the labour market and therefore difficult to find.
It can be assumed that the Cloud and possible new developments in this regard are the decisive factor for the future of data warehousing. Should the usability for various professional groups and "non-specialists" also improve in the future through BI tools or modelled access layers, data warehousing will perhaps be more relevant than ever.