In the previous post we have seen how important it is to engage with MLOps early on. MLOps platforms help to reduce manual steps, increase team collaboration, meet regulatory and compliance requirements and reduce time-to-market.
Due to the complexity of the topic and the numerous tools, the question arises: Where to start? Not every tool is equally important and some are only relevant after a certain team size. The following procedure model is therefore intended to provide orientation. It offers four levels of development: Prerequisites, Foundation, Scale, Mature. The basics of software engineering and data engineering are given the highest priority and thus enable teams with little ML experience to get off to an easy start. With each level, more and more complex products can be created and, above all, operated on the basis of ML.
Challenges of local ML models
The simplest procedure to run an ML model is to train a model locally and store it in an object storage. A (recurring) job picks up the model and applies it to a data set. The results are stored in a database or as a file. This approach may be fast, but it carries many risks and is not suitable for scaling. Running the training and data preparation locally, in addition to the threat of a data protection breach in many cases, makes teamwork much more difficult. The execution of the training is not visible to team members. They cannot access details of the training and continue the development. Time-consuming new developments are the result. In the case of faulty predictions, these cannot be traced because without versioning it is not possible to trace which model version was used and when. Audits1 of models with regard to the data used, legal or ethical issues are thus impossible. There is also a high risk of using faulty models, as tests cannot be applied automatically. In addition, training is limited by the computing power of the local system, so that this is not sufficient for some use cases.
The basics of ML operationalisation
- Python scripts instead of notebooks
In the first stage of expansion, the basics of modern software engineering and data engineering are implemented. These are versioning, tests and deployment pipelines. Versioning of code enables the traceability of changes and the simultaneous collaboration of several developers. Models are also versioned in the same way as code. Deployment pipelines create the training job and trigger it. For this purpose, model training is no longer executed in Jupyter Notebooks or as a Python script, but as a single job. The code for the training job is versioned and the job is deployed via CI/CD pipelines. The CI/CD pipelines create the possibility to run automated tests. These tests help to catch errors before the model training is started so that the model or the training job can be developed further more quickly. The trained model is stored and versioned in a model registry. The ML service for the prediction is also provided via a CI/CD pipeline together with the model. The provision of a new model is still done manually, i.e. a developer checks the quality of the new model using performance metrics. The ML service is usually either an API that delivers a prediction on request or a batch job that calculates predictions in an interval. Here, too, tests can be carried out with the model. Plausibility checks are one way of doing this.
The focus on using established tools such as CI/CD pipelines and tests makes it possible to quickly achieve a higher level of automation and reliability.
Discover what MLOps is all about, how it impacts the operations of machine learning applications, how it differs from DevOps, and how ML products can help address challenges in your organisation.
Basis of an MLOps platform
- Data Catalog
- ML Pipelines
- Experiment Tracking
In the second stage of expansion, the model training should be made reusable and further automated through the use of pipelines. In addition, data management should be set up at an early stage in order to maintain an overview of different data sources.
Pipelines enable the orchestration of all steps relevant for model training as well as the decoupling of these steps. Previously complex ML jobs can be divided into smaller steps and combined into pipelines. The steps are thus easier to understand and maintain and can be exchanged. As a result, they can be reused across different pipelines or easily shared via templates. Also, each step can be executed with dedicated computing capacity such as GPUs or RAM. In this way, only the resources that are actually needed can be provided to reduce costs. Conversely, this allows training to be scaled by making more resources available. The steps are also created via a CI/CD pipeline, where they can be tested automatically. The tests can also cover regulatory requirements. Tests that recognise personal data in the training data set would be possible.
In addition, model training metadata such as model configuration and performance are now captured, also known as experiment tracking. These are tracked during training and stored in a metadata store. This allows team members to see at any time which models have already been trained and how they have been evaluated.
Data management is done via a data catalogue. No dedicated tool needs to be used for this. An overview in a spreadsheet or in Confluence is often sufficient. It is crucial to create this at an early stage, as this also strengthens reusability. With regard to compliance requirements, it may also be necessary to document which data is used by ML products. By describing data sources, including their business context in particular, developers can quickly assess whether the data is suitable for a model. This prevents the need to re-evaluate data over and over again.
Focus on scaling
- Automated Deployment
The third expansion stage is intended to make it easier for teams to operate several products. To achieve this, manual interventions in the system must be reduced. For this purpose, model deployment now takes place automatically. Each trained model is compared with the model currently in production, the so-called champion model. If a newly trained model is better than the champion model, the new model automatically becomes the new champion model and is made available. It is important to further expand the automated tests. Tests for fairness, such as evaluating model performance for specific user groups, should also be carried out. This can prevent models from coming into production that beat the previous champion model overall but have a negative impact on customer groups that need special consideration. In view of the European Act on AI, such tests could become mandatory.
In addition, the expansion of monitoring is crucial. The performance of a model should be monitored at all times. For this purpose, all predictions are recorded together with the input data. The quality of the predictions can then be determined by comparing them with the actual values. The effort required to obtain the actual values depends strongly on the use case. Predictions of time series are made automatically, whereas the classification of images requires manual work. Depending on the number of predictions, the actual values can only be determined for a sample. This makes it possible to monitor the performance of the model over time. In addition, it is advisable to perform simple plausibility checks. If a probability is predicted, it must lie between 0 and 1. In case of anomalies, an alert is played to notify developers. The better the tests and alerts, the faster developers can correct errors.
The modular principle as Best Practice
- Easier reusability
- Greater automation
- More tests
The last expansion stage further strengthens the automation and reusability of the platform. Packages are used for the pipelines instead of templates. Each step is represented by a package that can be installed via pip, for example. This makes it easy to carry out updates and keep pipelines up to date over a long period of time. A feature store can also help to further increase reusability. In this way, features for model training are shared across teams.
In addition to further tests for the control of the model in production, the handling of alerts is more automated. Instead of manually triggering a new model training after an alert, the training is now started automatically.
For the success of ML products, it is crucial to think about operationalisation from the beginning. This process model is intended to provide an easy entry point to an advanced ML infrastructure. In view of the upcoming European Act on AI, such an infrastructure is necessary. Only with it can automated checks and tests be carried out, as is likely to be the case for high-risk systems.2 will be mandatory. Without these capabilities to monitor models and control their results, the use of machine learning will no longer be permitted in critical areas.
If you also want to make your machine learning use cases fit for regulatory requirements or you want to bring your use cases into production faster, we will be happy to support you. We conduct target-performance analyses with you, develop a plan based on your needs to close possible gaps and support you during implementation and maintenance.
1 The auditing of models offers an organisation the possibility to systematically check the models used for risks. The CRISP-DM framework is suitable as a guide. Legal and ethical risks in particular should be carefully examined.
2 The European Act on AI divides AI use cases into four risk levels: unacceptable risk, high risk, limited risk and minimal risk. High-risk systems are subject to strict specifications before they can be brought to market. The fully automated checking of credit applications, among other things, falls under high risk.