cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Drift & Model Comparison in Production MLOps: Handling Scale Changes with AutoML

spearitchmeta
New Contributor III

Background

I'm implementing a production MLOps pipeline for part classification using Databricks AutoML. My pipeline automatically retrains models when new data arrives and compares performance with existing production models.

The Challenge

I've encountered a data drift issue that affects fair model comparison. Here's my scenario:

  1. parts data: door lengths range 1-4 meters
  2. StandardScaler fitted on this range: scaler_v1.fit([1m, 2m, 3m, 4m])
  3. Model v1 trained on this scaled data

New Data Arrives:

  1. New parts: door lengths now range 2-6 meters
  2. Different distribution: new_data = [2m, 3m, 5m, 6m]
  3. Question: How should I handle scaling for fair model comparison?

Current Pipeline:

  • New data arrives
  • new_data_df
  • Train new model
  • Compare with existing model - BUT SCALING PROBLEM! It makes no sense to compare the old with new model in my opinion

Specific Questions:

Scaling Strategy: Should I:

  • Retrain new model on combined historical + new data for consistent scaling?
  • Use original scaler on new data (might clip values outside original range)?
  • Fit new scaler on combined dataset and retrain both models?

Data Drift Detection

What's the best approach to detect when scaling changes require full retraining vs. incremental updates? In MLOps, should I:

  • Always retrain on cumulative data (historical + new)?

  • Use statistical tests to detect drift threshold?

  • Maintain separate scalers and transform data appropriately for each model?

Environment:

  • Databricks AutoML for model training
  • MLflow Model Registry for
  • deployment StandardScaler for feature scaling Engineering data with
    naturally evolving distributions
2 REPLIES 2

BigRoux
Databricks Employee
Databricks Employee

Have you explored Lakehouse Monitoring?  It provides a comprehensive solution for drift detection.  You can read more here: https://docs.databricks.com/aws/en/lakehouse-monitoring/

 

Hope this helps, Louis.

Hey Louis,

Thanks for sharing and yes I stumbled on this article a few days before and I will investigate the potential of it in the upcoming days and maybe come back to you with some questions. Otherwise I will close the topic and mark your answer as solution 🙂
If you are interested I also got some interesting answer here in case it does not work with Lakehouse Monitoring.

https://datascience.stackexchange.com/questions/134303/data-drift-model-comparison-in-production-mlo...