Eventually, all software systems will encounter failures, and AI systems are not exempt from this reality. Failures in traditional software systems can stem from various sources, including infrastructure issues like server downtime, overheating CPUs, or network failures. Additionally, reliance on external software components, such as open source libraries, can lead to failures if there are incompatible updates. Moreover, human errors, such as faulty merges that disrupt deployment in production environments, can also contribute to system failures.
As software systems, AI systems are susceptible not only to the typical failures inherent in software but also to those specific to machine learning (ML). Among the most prevalent ML-specific issues are those arising from shifts in data distribution. These shifts can manifest in various forms, including alterations in input data distribution (known as covariate drift), changes in prediction data distribution (label drift), or modifications in the relationship between input and output data (concept drift).
The article does not delve into the extensively studied sub-field of DevOps concerning software-specific system failures and monitoring, as it lies beyond its scope. Instead, the focus will primarily be on ML-specific failures, detailing which specific attributes to monitor and providing guidance on implementing monitoring specifically on Databricks.
In supervised learning, machine learning models are trained to predict a target variable based on input variables, also known as features. When deployed in production, these models will accurately predict the target variable as long as three conditions are met: first, the distribution of the input inference data resembles that of the data used for training the model, second, the relationship between the input and output variables remains unchanged, and third, the distribution of the target inference data resembles that of the target data used for training the model.
Indeed, data evolves over time, and so does the relationship between input and output variables. Consider a scenario where your model serves as a recommendation system for rental properties. If you expand its usage to a new region with a demographic pattern distinct from the original dataset used for training, or if there's a sudden shift in regional dynamics—like the surge in popularity of rural properties due to a factor like the Covid-19 pandemic—the model may struggle to provide accurate recommendations. This highlights the importance of continuous monitoring and adaptation to address such shifts in data distribution and the relationship between variables.
When it comes to maintaining model performance, your focus should be on monitoring both input and output data. Detecting covariate, label, or concept shifts involves monitoring data throughout your AI system. This includes observing raw data, features, labels, and model metrics to ensure system integrity. However, monitoring some of these data sources can be more challenging than others. For instance, accessing true labels for calculating model performance metrics might be delayed in systems with long feedback loops. Additionally, cost considerations are crucial, especially if you're dealing with numerous models and thousands of features in production. The key takeaway is to monitor as much data as feasible within your budget constraints to effectively identify and address any potential failures.
In industry, two common methods are employed to detect distribution shifts. The first method involves examining the statistical attributes of a dataset, such as minimum, maximum, mean, median, and so forth. Significant shifts in any of these attributes over time may suggest a change in the underlying data distribution.
The second method is conducting a two-sample hypothesis test. This test evaluates whether the difference between two distributions (e.g., data used during training and inference data) is statistically significant. If a statistically significant difference is detected, it could indicate a shift in data distribution.
Databricks provides a feature known as Lakehouse Monitoring, designed to oversee the statistical properties and data quality of all tables within your account. This feature goes beyond mere monitoring, extending its capabilities to track the performance of machine learning models and endpoints. It achieves this by monitoring inference tables that store model inputs and predictions. Essentially, Lakehouse Monitoring is tailored to meet the needs of both data engineering quality control and machine learning monitoring.
The high-level process to set up monitoring is as follows:
Databricks Lakehouse Monitoring offers three distinct types of analysis: time series, snapshot, and inference. For detailed information on each type of analysis, refer to the official documentation. Essentially, these three types of monitoring enable you to oversee data throughout your Machine Learning pipeline. With snapshot or time series monitoring, you can track raw data and features, while inference monitoring allows you to monitor labels and model metrics (If you provide the monitor with true labels).
When a monitor is executed on a Delta table, it generates or modifies two metric tables: a profile metrics table and a drift metrics table.
The profile metrics table includes summary statistics for each column and for each combination of time window, slice, and grouping columns. Inference analysis also encompasses model accuracy metrics.
On the other hand, the drift metrics table comprises statistics that monitor alterations in a metric's distribution (including two-sample tests for drift detection). Drift tables serve the purpose of visualizing or alerting on data changes rather than specific values.
When a monitor is executed, it generates a Databricks AI/BI dashboard showcasing essential metrics computed by the monitor. The visualizations featured in the default dashboard configuration vary based on the profile type. The dashboard is organized into sections that present different information, such as null and zero-value profiling and numerical and categorical distribution changes.
Figure 1: Section of the generated AI/BI dashboard presenting the percentage of null and zero values. We see a steep increase in the percentage of nulls for the PreferredPaymentMethod column on July 4th, 2024 in the top chart.
Figure 2: Section of the generated AI/BI dashboard showing numerical distribution changes detected for two columns using the Kolmogorov-Smirnoff test. The line chart drills down to the TotalPrice column, with a marked increase in its daily maximum value on July 4th, 2024.
Figure 3: Section of the generated AI/BI dashboard showing categorical distribution changes detected. We drill down to the PreferredPaymentMethod column in the heat map. It shows a high percentage of null values on July 4th, 2024, indicated by the dark blue cell at the bottom right.
It's important to note that the dashboard refresh must be done separately from the monitor refresh that updates the underlying metrics tables. When you schedule your monitor to refresh (when creating the monitor explained in step 1), it solely refreshes the metrics table, not the dashboard. The dashboard needs its own scheduling to incorporate new values from the monitor refresh.
To receive notifications when issues are detected in your ML pipeline, you can write queries against the two metrics tables and set alerts based on the results. Detailed instructions on creating these alerts can be found in the official documentation.
It's crucial to establish a strategy for handling failures within your ML system. When issues arise, such as input feature drift or declining model accuracy, a thorough investigation into the root cause is necessary. This investigation may lead to retraining the model using either fresher data or additional inputs. In some cases, it may even require reevaluating the way the ML problem is framed if significant changes in assumptions have occurred over time. In most scenarios, we recommend a human-in-the-loop retraining process, where human analysis is crucial for identifying the root cause and making the final decision.
However, once a retraining strategy is approved by humans, it's essential to automate as much of the retraining process as possible. This automation might include streamlining the process for adding new features, accessing older versions of the model and their corresponding features, implementing a roll-back strategy in case of sudden issues with newer model versions, and more. By automating these processes, you can ensure efficiency and reliability in maintaining and improving your ML system.
All software systems, including AI, are prone to failures from infrastructure issues, external components, and human errors. AI systems also face unique challenges like data distribution shifts, which can affect performance.
The article highlights the importance of continuous monitoring to detect and address these shifts. Databricks' Lakehouse Monitoring tracks data quality and ML model performance by monitoring statistical properties and data changes. Effective monitoring involves setting up monitors, checking metrics, visualizing data in dashboards, and setting alerts.
When issues are detected, a human-in-the-loop approach to retraining models is recommended.
For more information on this topic, we recommend "Designing Machine Learning Systems," which also assisted us in writing this article.
Next blog in this series: MLOps Gym - Getting started with Version Control in Databricks
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.