A recent study from MIT, Harvard, The University of Monterrey, and Cambridge showed that 91% of ML models degrade over time. This study is one of the first of its kind, where researchers focus on studying machine learning models’ behavior after deployment and how their performance evolves with unseen data.
“While much research has been done on various types and markers of temporal data drifts, there is no comprehensive study of how the models themselves can respond to these drifts.”
Since we at NannyML are on the mission of babysitting ML models to avoid degradation issues, this paper caught our eye. This blog post will review the most critical parts of the research, highlight their results, and stress the importance of these results, especially for the ML industry.
If you have been previously exposed to concepts like covariate shift or concept drift, you may be aware that changes in the distribution of the production data may affect the model’s performance. This phenomenon is one of the challenges of maintaining an ML model in production.
By definition, ML models depend on the data it was trained on, meaning that if the distribution of the production data starts to change, the model may no longer perform as well as before. And as time passes, the model’s performance may degrade more and more. The authors like to refer to this phenomenon as “AI aging.” At NannyML, we call it model performance deterioration and depending on how significant the drop in performance is, we consider it an ML model failure.
The authors developed a testing framework for identifying temporal model degradation to get a better understanding of this phenomenon. Then, they applied the framework to 32 datasets from four industries, using four standard ML models to investigate how temporal model degradation can develop under minimal drifts in the data.
To avoid any model bias, the authors chose four different standard ML methods (Linear Regression, Random Forest Regressor, XGBoost, and a Multilayer Perceptron Neural Network). Each of these methods represents different mathematical approaches to learning from data. By choosing different model types, they were able to investigate similarities and differences in the way diverse models can age on the same data.
Similarly, to avoid domain bias, they chose 32 datasets from four industries (Healthcare, Weather, Airport Traffic, and Financial).
Another critical decision is that they only investigated pairs of model-dataset with good initial performance. This decision is crucial since it is not worthwhile investigating the degradation of a model with a poor initial fit.
To identify temporal model performance degradation, the authors designed a framework that emulates a typical production ML model. And ran multiple dataset-model experiments following this framework.
For each experiment, they did four things:
- Randomly select one year of historical data as training data
- Select an ML model
- Randomly pick a future datetime point where they will test the model
- Calculate the model’s performance change
To better understand the framework we need a couple of definitions. The most recent point in the training data was defined as (t_0). The number of days between $t_0$ and the point in the future where they test the model was defined as (dT), which symbolizes the model’s age.
For example, a weather forecasting model was trained with data from January 1st to December 31st of 2022. And on February 1st, 2023, we ask it to make a weather forecast.
In this case
- (t_0) = December 31st, 2022 since it is the most recent point in the training data.
- (dT) = 32 days (days from December 31st and February 1st). This is the age of the model.
The diagram below summarizes how they performed every “history-future” simulation.