From prototype to data product: Go live with these 3 best practices

von | 21 March 2019 | Basics

From the idea - e.g. to increase online sales of products - to the finished data product - e.g. the implementation of an operational Product recommendation system  on the website - a data science project must go through numerous steps. Currently, many data projects fail during the deployment phase. In this stage, the result of the prototype phase is transferred into an operational data product and integrated into the respective business processes. This is one of the most important phases of the data science life cycle, as it is here that it is decided whether a project can actually generate added value from data. A project can fail in this critical phase due to several challenges.

These challenges include data-related issues such as poor Data qualitydata protection aspects or the lack of data availability. In addition, a lack of skills, hurdles in the cooperation between the business department, data science and IT as well as a complex technology landscape make the decisive step from prototype to data product difficult. Based on our experience from over 500 data projects, we have 3 Best Practices which contribute to success in the deployment phase.

1) Take a data engineer on board - from the start

In recent years, the Data Scientist as the "Sexiest Job of the 21st Century". While data scientists are crucial to the success of a Data Science Project are indispensable, the role of a Data Engineers equally important - and even more important in the deployment phase.

Nevertheless Data Engineering not as much attention as Data Science. A Data Scientist is typically involved in the development and prototyping phase, for example, to develop Machine Learning Algorithms and statistical models. However, the real bottleneck of many data science projects is the Transferring these models into a stable and scalable data product - one of the main tasks of a data or machine learning engineer.

Transfer of a prototype into a scalable data product

This transition is anything but trivial: during the development and prototyping phase of a data science project, it is about selecting the right learning model and rapid experimentation on the way to proof of concept. In contrast, during the deployment phase, the data science project becomes a Software development project. While in a data science project many small changes, such as correcting errors in the data, can be done manually, this is impossible in a scalable data product.

These challenges in the deployment phase can be addressed by early involvement of data and machine learning engineers, ideally with experience in software development and deployment. Close collaboration between scientists and engineers in the early stages of a project avoids disproportionate technical debt and facilitates go-live.

2) Rely on the cloud

The use of the public cloud is becoming more mainstream, and not only in the field of data science. A prominent An example of this is Netflixwhich already completed the switch to Amazon Web Services (AWS) in 2016. We consider this decision to be groundbreaking because the trend towards using cloud solutions is a significant development for the productive use of data science and machine learning. The reason is simple: Infrastructure as a Service, so-called IaaS solutions, offer the flexibility that data science projects inherently need. The requirements and framework conditions in the course of a project naturally change enormously over the project life cycle.

In the early phases, data projects require an exploratory approach characterised by rapid iterations and frequent experimentation. This usually results in load peaks for training machine learning models and/or a need for specialised hardware such as GPUs (Graphics Processing Units). Depending on the use case, productive operation has very diverse requirements. Covering both phases (development and operation) via a homogeneous and inflexible hardware landscape, as found in many on-premises infrastructures, often leads to a discrepancy between requirements and the technologies used.

The cloud for fluctuating storage and computing capacity requirements

The cloud, on the other hand, offers scalable flexible storage and computing solutions that seamlessly adapt to these fluctuating requirements of a typical Data Science Lifecycle can be adapted. In addition, cloud providers have for some time increasingly been offering specialised infrastructures for Machine learning methods such as Deep Learning. In some cases, hardware requirements can therefore only be realised via the cloud. In some cases, this can significantly increase the speed of development.

Find out why you should rely on the cloud for your projects in our article on the topic "4 reasons why companies should rely on cloud technologies

Another key challenge in the delivery of data science projects is to ensure that the Development environment of the productive environment as closely as possible corresponds. Cloud computing facilitates this by using modern paradigms such as Infrastructure as Code. In addition, the flexible provision of infrastructure on-demand enables hardware and software to be fine-tuned to the requirements of a project or even a single task. Such cost-efficient "right-sizing" of infrastructure is difficult to realise outside the public cloud.

3) Embed your data product in an overarching data strategy

Lastly, it is helpful to leave aside for a moment all the technical intricacies and details of the delivery process. The central concern of a data-driven organisation is a clear Vision and Strategy to create value from Data. The focus of managers must therefore be on obtaining the right data to achieve their strategic goals. In order to benefit from data science projects and artificial intelligence in the long term, it is important, strategic invest in data collection. Every data product must therefore be considered from the perspective of the data strategy and embedded in it.

For example, a leading manufacturer of household appliances has made it a priority to develop a real-timeData pipeline to collect data produced by the more than 1.5 million connected devices worldwide. In addition, the company invested heavily in modern Data infrastructureboth on-premises and in the cloud. Now these investments can be leveraged through a variety of advanced use cases.

The data product in the overall context

In our experience, many data science projects fail because the existing database makes them impossible. In addition, due to the increased reporting on the potential of KI false expectations are created. Of course, we are firmly convinced that there is no way around the fact that companies must generate value from their data. But this only succeeds if the digitisation of companies precedes the data science phase.

It makes little sense to implement advanced use cases from the start if the Data basis (loosely based on the motto: "you have to learn how to make fire before you can shoot a rocket to the moon"). Especially in the beginning, it is therefore important to focus on simple Use Cases that produce results quickly, while working on a long-term data strategy.

As data-driven companies also need to evolve as an organisation - be it in terms of the Corporate culture or with regard to the construction of Data Skills and Know-how - it is important to always keep the motivation of the employees in mind. This gradually creates the basis for advanced AI applications.

<a href="https://www.alexanderthamm.com/en/blog/author/at-redaktion/" target="_self">[at] EDITORIAL</a>

[at] EDITORIAL

Our AT editorial team consists of various employees who prepare the corresponding blog articles with the greatest care and to the best of their knowledge and belief. Our experts from the respective fields regularly provide you with current contributions from the data science and AI sector. We hope you enjoy reading.

0 Comments