Robotics in care and industry

  • Published:
  • Author: [at] Editorial Team
  • Category: Deep Dive
Table of Contents

    What do a machine hall and a nursing home have in common? The answer: both areas benefit from a solution that detects puddles on the floor and can thus prevent accidents.

    In this article, we present a research project in which we are working on this task. You will gain insight into our role as a data science consultancy, the challenges we faced during the project, important decisions, and ultimately the result.

    Corporate goals and the role of research projects at [at]

    Consulting is tailored to the specific needs of customers. Regular projects include applying existing state-of-the-art technologies to use cases or introducing customers to the mindset of a digital company. But is that all there is to consulting? Fortunately, at [at], we have the opportunity to support AI research in several innovative projects. One of our current projects is called S³.

    S³ – short for “Safety Sensors for Service Robots in Production Logistics and Stationary Care” – is a collaborative project bringing together partners from the fields of logistics, healthcare, and robotics, as well as universities, to take a step forward in applied research. The project is funded by the Federal Ministry of Education and Research, and we are working with the following partners:

    • Institute for Conveying Technology and Logistics (IFT), University of Stuttgart
    • Fraunhofer Institute for Manufacturing Engineering and Automation (IPA), Department of Robot and Assistant Systems
    • Pilz GmbH
    • Bruderhaus Diakonie

    The long-term goal of S³ is to derive robot support in industrial contexts and in healthcare facilities. With knowledge of this broad field of research, what role does [at] play as a consultant in this collaboration? In our experience, collaboration works smoothly when each participant has their own area of responsibility. For us, this means implementing machine learning use cases:

    1. Automatic detection of spilled liquids on the floor (for industry and healthcare)
    2. Fill level detection in glasses and bottles (for healthcare)
    3. Detection of anomalies in objects and people (for industry and healthcare)

    Advantages and challenges of detecting spilled liquids

    At a high level, the goal of the use case “Spillage detection” is to design a model that automatically detects industrial liquids on the floor and spilled liquids in healthcare facilities (e.g., nursing homes). In the future, robots will be equipped with this technology, enabling them to locate puddles on the floor and prevent accidents by alerting maintenance personnel. In addition, the robot itself should not be hindered by the liquid.

    Spilled liquids on the floor are a potential hazard for autonomous systems and humans. Electronics can fail and people can slip and injure themselves. In addition, the liquid on the floor may contain chemical, biological, or other hazardous components that should not be spread further. Robots currently used in industry and healthcare typically only detect an anomaly on the floor, stop, and wait for a human to deal with the situation. Therefore, a model that can accurately detect the puddle, send an alarm, and safely navigate around it can save a lot of time, reduce potential incidents, and make operations run more smoothly. Nevertheless, the requirement to detect spilled liquids in different indoor environments poses the challenge of developing a model that has a wide range of applications.

    The field of object detection in images using ML techniques is very well developed. What makes our use case so challenging? Why can't pre-trained models solve this problem out of the box? Let's start with the general difficulty. Spilled liquids vary in shape and color, and their texture depends heavily on the environment. In addition, the reflection of the puddle surface makes it difficult for out-of-the-box object recognition algorithms to learn correctly. An even more serious (and widespread in ML research) challenge is the fact that there are currently no labeled datasets for indoor liquid detection.

    Data: A starting point for ML projects

    The investigation of state-of-the-art approaches for our use case points directly to datasets for autonomous driving, as self-driving cars also need to detect liquids on the ground.

    The Puddle-1000 dataset and the model developed by Australian researchers (https://github.com/Cow911/SingleImageWaterHazardDetectionWithRAU) have proven that image segmentation approaches are promising for detecting puddles outdoors. The model used in this project was based on Reflection Attention Units with a TensorFlow setup in the background. As a first approach, we implemented a dynamic UNET from fast.ai on the Puddle-100 dataset. In addition to fast results, a valuable outcome of this approach was a better feel and understanding of the nature of puddles: The environment (e.g., through reflection), the perspective from which the images were taken, and, above all, the effect of light have a major influence on puddles. Since all these characteristics vary enormously in outdoor and indoor scenarios, we decided that a new, self-produced dataset adapted to our indoor use case was needed.

    Once we had made this decision, many questions arose: What data do we need? How do we collect it? What metadata is valuable? As a starting point, we created a list of important influencing factors that should be taken into account when recording video:

    • Light (electric vs. natural, bright vs. dark, shadows, position of the light source)
    • Background of the indoor scene (color and texture of the floor, natural reflections on dry floors, number and movement of objects in the scene)
    • Size/amount of spilled water (no spillage, small spot vs. entire floor, one spot vs. many spots)
    • Camera orientation (horizontal, looking up vs. looking down)
    • Movement in the scene (standing vs. moving, change of direction)
    • Type of puddle (water, coffee, colored water to imitate hazardous liquids)

    We therefore decided to create videos that mimic robots in their movements and use a uniformly distributed medical and industrial background. The result was 51 videos, each 30-60 seconds long, together with an Excel spreadsheet that summarizes the metadata of each video.

    Labeling the data: Time-consuming but essential

    Now that we have our own dataset, the next big task for supervised learning is labeling the data! For segmentation tasks, this is the most important and time-consuming part of making a dataset valuable.

    Since the requirement in our segmentation use case is to detect precise boundaries of the puddle, we had to draw highly complex polygons in each image. Due to the estimated high effort involved in labeling, we saw the need for a suitable platform where many people (not necessarily with deep domain knowledge, e.g., working students) could work together.

    The result was the creation of a highly automated tool called CVAT with a connection to a cloud server and storage. One advantage of this solution is that it can be used on the web, providing an easy-to-use interface (can be run in a browser) for all users. In addition, it enables easy automation of tasks through the use of a Rest API. Learn more about this implementation in one of our future blog posts on best practices in labeling!

    How to find the right model

    The first step before we start ML coding is to set up a development environment that enables collaborative work in rapid iterations and lays the foundation for a scalable and reproducible solution. For simplicity, we decided to develop in Jupyter Notebooks hosted on an on-premise GPU cluster with four GPU nodes, each with ~12 GB of memory. The on-premise solution had two major advantages for us: lower costs compared to cloud pricing and time savings when onboarding new employees to the project. To ensure smooth teamwork, we introduced coding guidelines. These stipulate that every production-ready function is stored in a Python module with Git tracking and that a shared environment file is kept up to date to resolve conflicting package versions.

    Moving on to model development, we will focus on two main aspects: model selection and model tuning (although there are many more steps to consider, but they are beyond the scope of this blog post).

    Since the Fast.ai approach gave us fast and good results on the Puddle-100 dataset, we decided to start this use case on our own dataset and implement some further refinements on it. We would like to present some important challenges and our results here:

    • Thinking about a meaningful test-train split is important in order to be able to trust the results of model training. We decided to distribute the backgrounds evenly across the test and train splits and also applied a function (such as the blocking factor from the MLR package) that distributes the images extracted from a video evenly.
    • For loss functions for segmentation tasks, we recommend using focal loss and dice loss. It was very interesting that using the two best-known loss functions – namely Cross Entropy Loss and Binary Entropy Loss – gave us a scattered result (i.e., the puddle was detected, but not in a coherent form).
    • The right choice of metrics is crucial when it comes to evaluation. We defined various metric functions, such as dice_iou, accuracy, recall, negative_predicted_value, specificity, and f1_score. All metrics are implemented on a pixel and image basis, e.g., a proportion of detected puddle pixels that were actually puddles vs. a proportion of predicted puddle images that were actually puddle images. For us, it made the most sense to take a closer look at the dice_iou metric.
    • In addition to numerical measurement, we found it helpful to have a less automated but more customized validation function. Here, we found that comparing ground truth and predicted masks, grouped by metadata (e.g., background, industrial or non-industrial, spilled or not spilled), can be particularly useful for classifying the performance of our training.

    One of our biggest takeaways from working with the fast.ai package was that it is an easy-to-use out-of-the-box approach, but when it comes to customization and modification, it can get complicated pretty quickly.

    When we were ready to train the model, important decisions had to be made about the model architecture. We played around with different settings but quickly settled on a U-Net architecture that builds well on the pre-trained resnet18 encoder. U-Nets have become popular for image segmentation tasks in recent years. The idea comes from the 2015 paper “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Olaf Ronneberger, Philipp Fischer, and Thomas Brox. In a nutshell, the idea behind this architecture is to perform a downsampling routine on the input image, followed by subsequent upsampling to restore the original input size. This enables the precise localization of objects in the image. In addition, the architecture consists of a contracting path for context capture.

    Model tuning

    Our next step was to tune the hyperparameters. This task consists of two parts: testing the data augmentation and the model parameters. The biggest challenge here was to find an automated way to set the learning rate parameter correctly. In some samples of learning curves, we discovered many different shapes and a large influence of the choice of learning rate on prediction accuracy. Therefore, we implemented three options to find an “optimal” learning rate and included it as a parameter in the tuning:

    1. Minimal Gradient: Select the learning rate that has the minimum gradient. This value can be taken directly from the lr_find function from fast.ai.
    2. Minimal Loss Shifted: The learning rate is determined as follows: Find the minimum value and shift it one tenth to the left. This approach is based on a rule of thumb that goes back to Jeremy Howard (co-founder of fast.ai). This value can be extracted (with some additional effort) from the lr_find function from fast.ai.
    3. Appropriate Learning Rate: Since the direct methods for determining the learning rate mentioned above do not work correctly in some cases, we introduce a third approach based on a discussion in a fast.ai forum. The idea is to construct a grid that is shifted to the left from the right edge of the learning curve graph until a termination condition is met. This aims to obtain a learning rate that has a minimal loss gradient before the loss increases sharply. Note that with this approach, unique thresholds must be defined in advance for each loss function used.

    Once everything was set up, we performed a typical grid search in the parameter space shown in the figure below. For runtime reasons, we performed this tuning on images that were downsampled to 25%.

    At each step, we saved the learning rate plot, the loss plot for test and validation data, the metric plots (most important for us is the dice_iou measure), and the validation images that we defined in the inspect results function described above. After running the hyperparameter training over a weekend, we can conclude with the following results:

    • There are no clear “best values” for the parameter tuning.
    • The value for ‘dice_iou’ could be increased by approximately 5% compared to the standard parameters.
    • At the same time, the “f1_score” could be increased by 2% to the standard parameter setting.
    • It is difficult to compare the results from different loss runs by looking at the validation metrics alone. We therefore took a random sample of the plots/metrics saved for the parameters with the highest accuracies. We decided on the following set of hyperparameters (all other parameters are set to their default values):

    Max_rotate=0,

    max_zoom=1.4,

    kind_of_lr=‘min_grad’,

    loss=‘dice_loss’,

    cycle_length=48

    In summary, this time-consuming hyperparameter tuning did not give us the desired boost in model performance, but it did help us to better understand the model and lay the foundation for automated training in a pipeline. As a final experiment for our training process, we implemented a method with iterative training while simultaneously resizing the images. The idea comes from Jeremy Howard (co-founder of fast.ai) from the fast.ai course, which can be taken online for free (https://course.fast.ai/). The idea behind this approach is that in the first few runs (model training on a small image size), the model learns the overall behavior of the segmentation shapes. This allows the model to increasingly focus on the exact boundaries of the objects to be recognized in each loop. In our experiment, we started with reduced images of 25%, scaled to 50%, and then to 75% to finish training with the original image sizes. In each iteration, we manually set the learning rate twice (each time after seeing the learning rate curve). In order to transfer this training process into a final pipeline, we also implemented an automatic learning rate finder, where the user can choose from the three automatic learning rate finding options described above (Minimal Gradient, Minimal Loss Shifted, and Appropriate Learning Rate). The result of the dice_iou measure is as follows:

    Image size0.250.50.751.0
    Evaluated on reduced images53,9%67,9%78,8%76,8%
    Ausgewertet anEvaluated on images in original sizeBildern in Originalgröße22,6%55,6%73,2%76,8%
         

    Summary and outlook

    As data scientists, we are always faced with the same question: How can we measure whether a model is good? What does the accuracy value really tell us? Are we looking at the right things? Of course, an accuracy of 77% on our validation set shows us that the model is learning something. But is that enough? At this point, it is good to return to our initial motivation and the real-world application for which our model is intended: detecting puddles on the floor using support robots in healthcare and industrial contexts. The puddle must be detected and the boundaries of the overall shape should be clear, but it is not necessary for the shape to be detected 100% correctly, as would be the case with a task such as a robot grasping an object. As an intermediate implementation, the robot could stop and send an alarm when it detects a puddle. It is clear that in our application, a false positive (detecting a puddle when there is none) is less fatal than not detecting the puddle and spreading the unwanted liquid even further.

    When evaluating whether our model works “well,” one answer for us was to select a few images in a targeted and clever way. On the one hand, we grouped the images according to our collected metadata, and on the other hand, we found it helpful to merge the images back with the original videos to see the prediction behavior when playing the video. We would like to leave it up to the reader to judge such an output:

    In summary, we can say that the initial results for our use case look very promising. Puddles are detected in most cases and a clear trend can be seen in the videos. Of course, as always, there is room for improvement, which can be addressed in the future. Our next steps are to test the model on additional data (which is currently being recorded) and to test how well our model can be generalized, followed by implementation of the model in the robot system and real-world testing.

    Author

    [at] Editorial Team

    With extensive expertise in technology and science, our team of authors presents complex topics in a clear and understandable way. In their free time, they devote themselves to creative projects, explore new fields of knowledge and draw inspiration from research and culture.

    X

    Cookie Consent

    This website uses necessary cookies to ensure the operation of the website. An analysis of user behavior by third parties does not take place. Detailed information on the use of cookies can be found in our privacy policy.