On average, every person today generates 600-700 megabytes of data per day - at work or in private. Large amounts of data are generated in the area of Industry 4.0 in particular. Sensors that provide values about their environment or data that is stored in networked IoT-The data that is recorded on the internet and on devices are just two of the countless data sources that lead to a veritable flood of information in companies today. That's where the data lake comes in.
The crucial question in view of the data streams is: How can added value be drawn from the enormous amounts of data? In solving this problem, the Data Lake a key role. A data lake offers the possibility to store an extremely large amount and variety of data and at the same time to use this data effectively for Data evaluations (Big Data Analytics).
What is a data lake? A Data Lake (literally: "data lake") can best be described as a oversized hard disk imagine. Instead of storing data in folders distributed in different locations, a data lake gathers data from different sources. All data in one place. To stay with the metaphor, it is a reservoir that, like a lake, has many sources and inflows. The term itself goes back to James Dixon, founder and CTO of Pentaho. He defined the data lake as follows:
"If you think of a datamart as a store of bottled water - cleansed and packaged and structured for easy consumption - the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."
In a company, this means that, for example, not every individual department creates and evaluates its own data collection, but that there is a common place where all data is stored. Data from external sources is also stored there. Data sources (market data, weather data, social media data, etc.) that are to be evaluated are also stored. However, a data lake is more than just a single large storage location for all data in a company.
Data Lake vs. Data Warehouse
The two terms Data Lake and Data Warehouse are often used together. It is often claimed that the data lake is just a new version of the data warehouse. However, there is basically only one thing in common between the two forms of data storage: Both systems serve to store data.
Link tipNo matter where data is stored, it is always important for data analysis to have a high level of security. Data quality - read here how this can be ensured.
Compared to other forms of data storage such as relational databases or a Data Warehouse data that is stored in a data lake is not specially prepared in advance. Rather, it ends up there as raw data or as unstructured data.
The essential difference becomes apparent in practice. A data lake is a centrally catalogued Summary of distributed data sets. The decisive advantage is that large amounts of data can be used in their entirety, irrespective of their concrete use in individual cases. Original format are stored. A data warehouse stores only prepared and structurally organised data sets for direct utilisation for business information services.
Structured vs. unstructured data
Unstructured data have, in contrast to structured no predefined format and no formalised structure otherwise. Examples of unstructured data that need to be processed before they can be evaluated are text data (such as emails, customer reviews, forum posts, etc.) or image data that may be generated, for example, during manufacturing to ensure production quality.
A data lake is therefore far less restrictive when it comes to storing data and therefore offers greater flexibility. Flexibility. In this can be permanently all available data streams flow into itClick streams, log files, images, text data, sensor data, publicly available data such as social media posts, etc. Instead of only analysing pre-defined correlations, this wealth of data brings the prerequisite for Advanced Analytics.
The basic structure of a data lake
In many cases, a data lake is based on a "Hadoop cluster" or a "Hadoop Distributed File System", or HDFS for short. An HDFS usually consists of commercially available hardware. This makes it particularly cost-effective, since
- commercially available hardware is inexpensive and
- the software used on it and the extensions Open Source
Another advantage of a Hadoop-based framework is that it can accommodate any number of data formats and very large volumes. However, a data lake also includes numerous other components. For the users of data lakes, the easy-to-understand user interfaces are particularly important. Tools such as dashboards or interactive Data visualisations provide the right overview. They are the prerequisite for ensuring that the data analyses actually result in Actions be transferred.
Reading tip: Read more in this article about Data visualisations and the power of the visual.
What is the benefit of a data lake?
In general, a data lake serves as a large data warehouse (Repository) and is thus at the same time a data management platform. The creation of a data lake is therefore also an ideal way to dissolve or avoid "data silos", "data graveyards" or "data swamps".
A shared repository also brings another key advantage. By making a wide variety of data from different origins easily and quickly accessible, it is possible to latent connections that might otherwise remain hidden. Assuming there is an accumulation of complaints in the service for a certain product or function, this can become visible in an evaluation in quality assurance or directly in production.
Furthermore, a Data Lakes plays a central role in the context of a agile data strategy. Companies that want to access certain data very quickly will find an architecture in the data lake that meets their needs. In addition to speed, a data lake is characterised by the fact that in particular Highly specialised and complex issues can be answered quickly. Because of these possibilities that a data lake offers, it is possible to Data on an important production factor in companies.
In the course of over 500 Data Science and Big Data projects that we have successfully carried out, we have been able to gather a great deal of experience with Data Lakes. Based on this extensive experience, we offer customer-oriented, strategic consultations on the advantages of a data lake compared to data warehouses. We also offer support in the selection of suitable software frameworks and project management for the technical implementation of a data lake.
To this end, we offer our customers individual Data Science Workshops for the development of individual data storage strategies. In addition, we accompany our clients in the implementation of a data lake upon request. The data volumes collected in a data lake enable our clients not only to improve their current data projects, but also to optimally use them for future developments to be prepared.