Databricks Community

spearitchmeta · ‎08-14-2025

Background

I'm implementing a production MLOps pipeline for part classification using Databricks AutoML. My pipeline automatically retrains models when new data arrives and compares performance with existing production models.

The Challenge

I've encountered a data drift issue that affects fair model comparison. Here's my scenario:

parts data: door lengths range 1-4 meters
StandardScaler fitted on this range: scaler_v1.fit([1m, 2m, 3m, 4m])
Model v1 trained on this scaled data

New Data Arrives:

New parts: door lengths now range 2-6 meters
Different distribution: new_data = [2m, 3m, 5m, 6m]
Question: How should I handle scaling for fair model comparison?

Current Pipeline:

New data arrives
new_data_df
Train new model
Compare with existing model - BUT SCALING PROBLEM! It makes no sense to compare the old with new model in my opinion

Specific Questions:

Scaling Strategy: Should I:

Retrain new model on combined historical + new data for consistent scaling?
Use original scaler on new data (might clip values outside original range)?
Fit new scaler on combined dataset and retrain both models?

Data Drift Detection

What's the best approach to detect when scaling changes require full retraining vs. incremental updates? In MLOps, should I:

Always retrain on cumulative data (historical + new)?
Use statistical tests to detect drift threshold?
Maintain separate scalers and transform data appropriately for each model?

Environment:

Databricks AutoML for model training
MLflow Model Registry for
deployment StandardScaler for feature scaling Engineering data with
naturally evolving distributions

Louis_Frolio · ‎09-22-2025

Here are my thoughts to the questions you pose. However, it is important that you dig into the documentation to fully understand the capabilites of Lakehouse Monitoring. I will also be helpful if you deploy it to understand the mechanics of how it works:

1. Is Lakehouse Monitoring designed for Production Only?
Lakehouse Monitoring is ideal for production scenarios to ensure ongoing data and model quality, but it can also be leveraged in development and staging environments to proactively identify issues before they impact production. Using it at earlier pipeline stages can help refine workflows and catch data quality problems early.

2. Is a timestamp column needed for inference tables?
Inference profiling benefits from a timestamp column to enable time-based drift and performance tracking, but if your data lacks timestamps, you can use a snapshot profile for full-table analyses. However, for more granular drift detection and rolling window comparisons (such as weekly or bi-weekly imports), adding a timestamp is recommended to harness the full capability of time-series monitoring. If you do not have a timestamp, drift and data freshness analyses lose some temporal context.

3. Can drift detection metrics be customized?
Yes, Lakehouse Monitoring provides out-of-the-box metrics such as Kolmogorov-Smirnov (K-S) and chi-squared tests for data drift but also supports custom drift metrics. You can define aggregate, derived, or entirely custom drift calculations using the provided API, enabling tracking of business-specific or advanced statistical measures as needed.

4. Does it support multi-output classification?
Lakehouse Monitoring can operate on inference tables where multiple output columns (targets) are present, calculating performance metrics for each. However, ensure your table schema explicitly documents each output and that ground truth (if available) is properly aligned. Support for monitoring is tied to table design—well-structured multi-target inference tables enable multi-output drift and performance tracking.

5. How does it handle data volatility and evolving labels/targets?
The monitoring system is robust enough to track changing distributions, the appearance of new categories, and the introduction of new targets. When new columns or labels are added over time, ensure monitoring profiles are updated to include them. Lakehouse Monitoring’s metric tables and dashboards make it possible to spot structural and distributional

Hope this helps. Cheers, Louis.

View solution in original post

Louis_Frolio · ‎08-25-2025

Have you explored Lakehouse Monitoring? It provides a comprehensive solution for drift detection. You can read more here: https://docs.databricks.com/aws/en/lakehouse-monitoring/

Hope this helps, Louis.

spearitchmeta · ‎08-25-2025

Hey Louis,

Thanks for sharing and yes I stumbled on this article a few days before and I will investigate the potential of it in the upcoming days and maybe come back to you with some questions. Otherwise I will close the topic and mark your answer as solution 🙂
If you are interested I also got some interesting answer here in case it does not work with Lakehouse Monitoring.

https://datascience.stackexchange.com/questions/134303/data-drift-model-comparison-in-production-mlo...

spearitchmeta · ‎09-22-2025

Hello Louis it's me again 🙂

I come back to you after having explored the potential of Lakehouse monitoring. Somethings are bothering me to be honest especially since I am nt at production level at the moment and I would like to ask you some questions

1) Is the lakehouse monitoring designed for Production only ideally?
2) In my case I am working on an ML usecase. Inference tables would be the optimal choice but do I need a timestamp column for it? My data is pulled on w WEEKLY or BI-WEEKLY basis from Azure Cosmos DB Graph Database. The nodes are imported and saved in a table in databricks. Then a model would be trained after checking data quality and drifts. If there is a drift I would append the newly pulled data from cosmos db to my baseline table (table use for training). The table does not have a Timestamp column but as mentioned the data is pulled on a WEEKLY or BI-WEEKLY basis from Cosmos DB Graph.

3) Can I specify the metrics to be used for drift detection or are they all included? (I recall that the kstest and chi square test are present but there are also other ways to detect drifts.)

4) Does the Lakehouse Monitoring (Inference Tables) work for a Multi Output classification where we would be haviong more than one target column to classify?

5) In my use case I will be having a lot of data volatility, meaning that new labels will introduced, adjusted and new target columns will be introduced on the long term.

Louis_Frolio · ‎09-22-2025

Here are my thoughts to the questions you pose. However, it is important that you dig into the documentation to fully understand the capabilites of Lakehouse Monitoring. I will also be helpful if you deploy it to understand the mechanics of how it works:

1. Is Lakehouse Monitoring designed for Production Only?
Lakehouse Monitoring is ideal for production scenarios to ensure ongoing data and model quality, but it can also be leveraged in development and staging environments to proactively identify issues before they impact production. Using it at earlier pipeline stages can help refine workflows and catch data quality problems early.

2. Is a timestamp column needed for inference tables?
Inference profiling benefits from a timestamp column to enable time-based drift and performance tracking, but if your data lacks timestamps, you can use a snapshot profile for full-table analyses. However, for more granular drift detection and rolling window comparisons (such as weekly or bi-weekly imports), adding a timestamp is recommended to harness the full capability of time-series monitoring. If you do not have a timestamp, drift and data freshness analyses lose some temporal context.

3. Can drift detection metrics be customized?
Yes, Lakehouse Monitoring provides out-of-the-box metrics such as Kolmogorov-Smirnov (K-S) and chi-squared tests for data drift but also supports custom drift metrics. You can define aggregate, derived, or entirely custom drift calculations using the provided API, enabling tracking of business-specific or advanced statistical measures as needed.

4. Does it support multi-output classification?
Lakehouse Monitoring can operate on inference tables where multiple output columns (targets) are present, calculating performance metrics for each. However, ensure your table schema explicitly documents each output and that ground truth (if available) is properly aligned. Support for monitoring is tied to table design—well-structured multi-target inference tables enable multi-output drift and performance tracking.

5. How does it handle data volatility and evolving labels/targets?
The monitoring system is robust enough to track changing distributions, the appearance of new categories, and the introduction of new targets. When new columns or labels are added over time, ensure monitoring profiles are updated to include them. Lakehouse Monitoring’s metric tables and dashboards make it possible to spot structural and distributional

Hope this helps. Cheers, Louis.