Machine learning applications have been making waves for several years now as this technology is conquering one industry after another. At this point, algorithms and the quantity and quality of available data for machine learning models are advanced. However, the professionalization of operations is lagging behind. Meanwhile, a new approach has been established to facilitate the productive use of machine learning - MLOps, composed of machine learning (ML) and operations (Ops). MLOps is a collection of methods and software tools to enable the operational usage of AI applications. We clarify the following key aspects of MLOps:
- What does MLOps mean in detail?
- How does MLOps differs from DevOps?
- What are the challenges of operationalizing ML products that are solved by MLOps?
What is MLOps exactly?
The professional adoption of machine learning is continuously growing: Over 50 percent of companies already use machine learning in production. In parallel, companies and their data scientists, engineers, and product managers have been observing that productive operation is proving to be challenging, usually more challenging than expected. Gartner has determined that by 2022 only 54% of ML projects will have made it into production successfully. And even if productive operations succeed, time-to-market until this point is long: in almost half of the cases, teams spend more than 30 days to deploy a finished model into production .
The answer to ineffective ML product development is MLOps. MLOps is a collection of methods and technologies to enhance the efficiency of machine learning model development as well as of operational usage of products based on those ML models. When developing an ML product, a distinction is made between 3 steps: Design, Build and Run.
Through mature frameworks for data processing (Pandas, Spark, and co.), machine learning (PyTorch, Scikit-Learn, etc.), and cloud offerings, the first two steps are manageable nowadays if the complexity of an ML product is low to medium. It is beyond that where the real challenges arise. The deployed model needs to be monitored, maintained, and optimized. And once the development of an initial ML product has been "completed", plans for the next ML product ideas are often already on the horizon. Thus, the effort beyond the development of data pipelines and machine learning is constantly increasing. Coordinating growing teams and maintaining compliance also stand in the way of successfully scaling machine learning operations.
“Developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.”D. Sculley et al, "Hidden Technical Debt in Machine Learning Systems, 2015"
MLOps methods and software are therefore solutions to reduce technical debt caused by ML products. The most important thing to remember is that MLOps methods and effective collaboration are the focus and the choice of tech stack is secondary. The key to success is, have a precise idea of how the model should be operationalized from the start of the development of an ML model(design phase). Or in other words, MLOps methodology is applied from the beginning of the project ideally.
Do you need support with your MLOps projects? From training and maturity analysis to the complete development and maintenance of MLOps products, Alexander Thamm GmbH offers its customers a wide range of services in the field of MLOps. Find out more on our services page and contact us at any time for a no-obligation consultation:
MLOps platforms help to increase team collaboration, meet regulatory and compliance requirements and reduce time-to-market. Learn more in our Deep Dive on the topic:
What challenges does MLOps solve?
If MLOps is meant to reduce technical debt, what exactly does that mean? In our understanding, there are organizational and technological obstacles that have a negative impact on productivity and costs but can be avoided. These are the recurring problems when operationalizing machine learning:
Excessive manual effort for training and maintenance
An ML model is usually based on real-world data, which is constantly changing. Therefore, ML models become outdated rapidly, and regular retraining is necessary. In a non-optimized environment, the resulting effort can tie up a significant amount of time. Let us consider a fictional example: a team of Data Scientists and Engineers spends about 20h per month maintaining a single ML model in operation. This effort is due to a low level of automation of data processing, model tuning, and deployment of new models. The ML product itself is a success, and a second ML product, including its model, is later deployed. Due to the lack of automation and synergies, the monthly effort is now at 40h. By the way, this only refers to the maintenance of those ML models in a narrow sense. These 20h/40h monthly are not spent on any improvements of those products – they only address maintaining the status quo. Thinking beyond that, merely operating a few further models would keep an entire workforce busy at some point, up to a state where no capacity is available for the development of further ML products.
It is also important to keep in mind that automation in machine learning development provides documentation , just as it does in the case of any DevOps tools. Even if a group of ML products can be managed by a handful of team members, will this continue to function once these members are absent or leave the project (think of "bus factor")?
Lack of cost efficiency in ML training and operations
Machine learning can be expensive, especially when it is GPU-based. Monthly sums of four- to five figures for computing capacity used for model training and operations can be expected for larger organizations. Thus, it is especially irritating if hardware resources are wasted due to an ineffective or even faulty setup, and the potential for savings is wasted as a result. Typical root causes are GPU instances that run idle due to ineffective scaling or redundant training of models even though there is no change in input data or parameters. This problem can also occur on on-premise infrastructure, resulting in wasted computing time which could be spent better otherwise.
Lack of knowledge about how successfully an ML model performs
ML models are stochastic, not deterministic. The underlying data quickly becomes outdated (data drift), and it must be continuously monitored whether the predictions behave as expected, especially between model updates (model drift). In any case, KPIs must be defined to quantify the performance of an ML product. These can be classic statements about accuracy, such as specificity or sensitivity, if obtainable. Alternatively, click numbers, engagement rate, number of incidents, etc., could also qualify as measures.
As long as only a few models are in operation and their performance can be tracked manually, automation is not necessarily required until this point. However, as soon as the number of models in production or the deployment frequency increases, automated monitoring of performance becomes a necessity – otherwise, we circle back to problem no. 1, and the manual effort reaches an unacceptable level.
Undefined responsibilities and lack of efficient cooperation
This problem is organizational in nature. A team around a fresh ML product may start with a few people to prepare data and train ML models. For these and all subsequent processes, responsibilities must be clearly defined, or numerous questions arise, such as:
- Who takes care of operations when running the model?
- Who ensures that a model meets the requirements, both at launch as well as 6 months into production? If any requirements were clearly defined at all.
- Who verifies that the data supplied for training from a DB or API is complete and correct?
Even in the age of DevOps it happens all too often that Data Scientists dump their trained models over the fence and are hardly involved in operational aspects. In cross-functional teams, it is crucial that responsibilities are clearly defined. In addition, uniform tech stacks should be used across all ML products as much as possible, i.e., identical tools for versioning, artifact storage, delivery of the model, etc. Otherwise, each team must manage such tools, which is time-consuming.
Uncertainty concerning compliance measures and GDPR
Every company must comply with external regulations such as the European General Data Protection Regulation (GDPR) as well as internal compliance guidelines. Unfortunately, compliance and data protection officers often only become involved in the product development process at a late stage or simply never. As soon as uncertainty arises as to whether an ML product complies with all relevant regulations, usage must be paused in the worst case, even if it turns out that no violation has occurred. If a product must be taken out of service until compliance with legal regulations has been verified, losses in sales and productivity can be enormous. It must be clearly stated here that technology by itself cannot overcome this challenge. Only a human can assess whether an ML product meets the requirements of the GDPR or industry-specific regulations. However, a unified tech stack enables a faster evaluation of compliance compared to 10 individual solutions for 10 ML products. If a uniform ML infrastructure is in place and officers for compliance and data protection are involved at an early stage, decision-making processes are shortened, and compliance risks are minimized.
Organizations that intend to run multiple ML products should apply the MLOps philosophy as early as possible, preferably from the start. Implementing MLOps after the fact, while feasible, becomes more complex. But whether from the beginning or later, 97 percent of organizations that apply MLOps achieve significant improvements. . Meanwhile, regulatory hurdles are likely to increase sharply soon. In 2023 or 2024, the EU-wide "AI Regulation" is expected to come into force. By then, compliance will no longer be just an abstract requirement but a concrete legal basis that every commercially used ML product must adhere to.
But how is MLOps implemented in practice? How teams implement MLOps in a meaningful way, which tools are needed for this and which are less - you will soon find out. in the second part of our MLOps special.
Terms: DevOps vs. DataOps vs. MLOps
- DevOps: Describes the convergence of development (Dev) and operations (Ops) using practices that reduce the effort to apply changes to the software to production. In practice, DevOps uses techniques such as containers, infrastructure-as-code, and continuous integration (CI). DevOps entails general methods for efficient operations and is not specifically tailored to machine learning, AI, or big data processing.
- DataOps: DataOps includes all methods of DevOps, extended by methods addressing the efficient processing and integration of (big) amounts of data. This typically includes the implementation of data pipelines, monitoring of the pipelines, and incident handling. The term "DataOps" is used less frequently, since by now “MLOps” is established to describe the operational aspects of machine learning and the required data processing required as a whole.
- MLOps: A collection of methods and tools for the efficient creation and operation of machine learning models and products based on them. MLOps thus includes all aspects of DevOps and DataOps and is therefore the most complex of these activities.
In addition, terms such as "ModelOps" or "AIOps" can be found on the web and occasionally in the literature. They can be regarded as synonyms for MLOps.
 https://cdn2.hubspot.net/hubfs/2631050/0284%20CDAO%20FS/Algorithmia_2020_State_of_Enterprise_ML.pdf , p.12
 https://pages.barc.de/hubfs/Marketing/Reports/Report_Driving-Innovation-with-AI.pdf, p.10
 Sculley, D. et al. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems