How to deal with missing values

from | 26 October 2022 | Tech Deep Dive

Imputation methods and when to use them

Why we should care about missing values

In our work as data scientists, we often work with time series data and applications for forecasts (=forecasting). As with all "classic" machine learning projects, there is often missing data in forecasting that we have to deal with.

For forecasting, missing values are particularly problematic because the models usually contain time-dependent features. Time-dependent features are usually lagged or seasonal features. If the patterns of these features are broken, the models have difficulty learning these patterns. Moreover, most classical forecasting methods such as ARIMA models cannot automatically deal with missing values.

The first important step in dealing with missing values in time series is - as is so often the case - to analyse the data. These analysis steps can help determine whether the data are missing by chance or not. When values are missing non-randomly and depend on the specific forecast context, automated methods usually cannot handle the effects and expertise is needed to deal with these cases. 

An example of non-random missing data is the prediction of shop sales, where data is always missing on public holidays. On these days, sales are zero, and on the days before and after the holiday, sales usually increase. One method to overcome this problem could be to include features such as "holiday" or "day after holiday" in the model.

An example of random missing data could be accidents or errors that occurred during the transfer of data from the operational system to a data warehouse. These accidents did not depend on the data and happened randomly at a certain time. If, on the other hand, it is an error that occurs regularly at certain times of the day or week, it would again be data that is not missing by chance.

Basically, it is important to pay attention to the following details when carrying out the analysis:

  • What is the percentage of missing data in relation to the total?
  • Do the gaps occur at regular intervals?
  • What is the length of the gaps?

The most important methods of imputation and what to consider when using them

Most of the methods we list here are well-known and have already been used in scientific papers and other Online sources well described. In this article, we mainly want to compare the different methods and give advice on when to use which method.

The first and by far the simplest method for dealing with missing data is to delete it. This solution can be helpful for some classic machine learning problems, but usually not for time series. As mentioned earlier, this breaks the patterns of the time series and the model learns from wrong patterns. Nevertheless, it is possible to select only the "healthy" part of the dataset if this makes sense with regard to the intended goals. For example, a feasible solution might be to discard the first half of the data series if the quality is better from then on (e.g., because the data collection process changed at a certain point in time). This option is only an option if we have enough data to train the models.

When deleting data is not an option, we need to use some imputation methods, i.e. methods to fill in the missing values. One family of imputation methods uses the Mean value. Here the idea is to fill in the missing values with the global, local or seasonal mean to get an appropriate replacement value. Each of these methods is useful in different contexts, depending on the trend and seasonality of the data. Another option is to fill in the missing values with the previous or next available value, which is called the forward resp. backward fill is referred to.

The last family of imputation methods we want to mention here is interpolation, i.e. the insertion of missing values into the observation range of a time series. This interpolation can be linear, polynomial or spline, which defines the type of curve used to model the data. A linear interpolation can also be adjusted for seasonality (this is called linear adjusted in the following). Another interpolation method is to use Holt-Winters, a third-degree exponential smoothing that captures the level (first degree), trend (second degree) and seasonality (third degree) of the linear interpolation of the gaps.

The decision tree below summarises our decisions and consolidates the rule of thumb we use most often. Although it is a good guide, we need to check with more precise metrics and the eye test, i.e. we plot the imputed values of different methods to check which best fits the pattern of the time series and helps the models to make more accurate predictions.

Assessing the quality of imputation methods

For a better understanding of the steps we normally follow when imputing missing values, let's consider a concrete example. Here we start with a complete dataset and artificially create missing values in it to compare the imputed values with the real ones. We have a data set from the Store Item Demand Forecasting Challenge used on Kaggle, which contains five years of sales data for 50 different items in 10 different shops. As mentioned earlier, the dataset is clean and contains no missing values. To reduce complexity, we aggregated the data to a monthly frequency.  

We have developed a function that randomly inserts a predefined percentage of missing values into the time series, with the option to parameterise the gaps (number and maximum length). With this function we can try different percentages of missing data and gap lengths. In this example, there are six gaps, the longest being three months long and 10 % of the time series missing.

Since the dataset now contains missing values, we apply some of the methods listed above and contrast them with the real data by visualising them and comparing them with metrics. In the following example, we have only applied this to one of the time series from the dataset, whereas in a real project we would probably be dealing with many more time series. 

In the graph below you can see the sales data sets for item 7 of shop 1. The orange line describes the complete data we have. The blue line shows the data we removed from the set for our experiments. The small crosses in different colours show the imputed values calculated with different methods. You can see that the linear fitted and the Holt-Winters imputation methods are better able to capture the variations in our time series. Because the data contains a trend and seasonality, the two methods that take trend and seasonality into account perform best. This confirms the rule of thumb we have shown in the figure above.

Since our goal is to obtain the best forecast result for future periods, we want to measure the impact of the different imputation methods on a forecast model. Therefore, we fitted the same forecasting model for each method and compared the errors for the test set (the last year of the series) using the following classical regression evaluation metrics: 

  • The symmetric mean absolute error in percent (sMAPE) or the mean absolute deviation in percent, which expresses accuracy as a ratio. It is very intuitive, but can become very large due to a small error deviation (for a set of small denominators).
  • The mean absolute error (MAE), which measures the gap between the individual forecasts and the actual value, is easier to interpret.

Below are the results of the forecasting model for the different imputation methods. As already indicated in the above graph, it can be seen that the Holt-Winters method and the linear adjusted imputation provide better forecasting results than the mean method. However, it also shows that the results of the seasonal mean method are comparable to those of the linear adjusted method. This confirms that one should try and compare several methods instead of just relying on the rule of thumb. Furthermore, we see that both metrics show the same result, which reinforces our conclusion.

Of course, this was only a model for one time series, and in a realistic example we would normally try different forecasting models for different time series. In this article we wanted to focus on the different methods of imputation and their evaluation.

Conclusion & tips for use

In summary, one should always start with an analysis of the time series data and investigate whether the data are missing by chance or whether there are certain patterns. In the latter case, one should try to find explanations with the help of one's own expertise or by consulting experts in the field. For example, when analysing data, one should already ask whether gaps occur at regular intervals, how long the gaps are and what the percentage of missing data is in relation to the total. Answering these questions will often provide information and help to find either specific solutions or suitable imputation methods according to our rule of thumb.

Even though the rule presented is a good guide and starting point for choosing the appropriate imputation method, it is also necessary to always evaluate it with care. It is enormously important and helpful to combine a visual assessment with an analytical one - as with any forecasting model.   

Care should be taken in visual evaluation to determine which method best fits the pattern of the time series. For analytical evaluation, the imputed values from different imputation methods can be used to train forecast models and evaluate them on a test set of the data. This approach is useful because the goal of imputation is to achieve good forecasting results. Common forecasting metrics such as MAE or SMAPE can be useful for evaluation - but the appropriate metric always depends on the specific business objectives.

Author

Arnaud Frering

Arnaud joined [at] in 2021 as a Data Science Trainee after completing his Master's degree in France. Now he works as a Data Engineer along our entire Data Journey: from Data Strategy and Data Catalogues to Data Science. He is passionate about football and basketball.

0 Kommentare