Background
I'm implementing a production MLOps pipeline for part classification using Databricks AutoML. My pipeline automatically retrains models when new data arrives and compares performance with existing production models.
The Challenge
I've encountered a data drift issue that affects fair model comparison. Here's my scenario:
- parts data: door lengths range 1-4 meters
- StandardScaler fitted on this range: scaler_v1.fit([1m, 2m, 3m, 4m])
- Model v1 trained on this scaled data
New Data Arrives:
- New parts: door lengths now range 2-6 meters
- Different distribution: new_data = [2m, 3m, 5m, 6m]
- Question: How should I handle scaling for fair model comparison?
Current Pipeline:
- New data arrives
- new_data_df
- Train new model
- Compare with existing model - BUT SCALING PROBLEM! It makes no sense to compare the old with new model in my opinion
Specific Questions:
Scaling Strategy: Should I:
- Retrain new model on combined historical + new data for consistent scaling?
- Use original scaler on new data (might clip values outside original range)?
- Fit new scaler on combined dataset and retrain both models?
Data Drift Detection
What's the best approach to detect when scaling changes require full retraining vs. incremental updates? In MLOps, should I:
Always retrain on cumulative data (historical + new)?
Use statistical tests to detect drift threshold?
Maintain separate scalers and transform data appropriately for each model?
Environment:
- Databricks AutoML for model training
- MLflow Model Registry for
- deployment StandardScaler for feature scaling Engineering data with
naturally evolving distributions