Databricks Community

ddpotapov · ‎01-17-2025

Hi Databricks Team,

I am trying to understand the "model from code" approach. I am reading your Big Book of MLOps.

Is it correct that when using this approach I need to train the model twice - in development and in production?
I am asking because in this case the process of training the model can be very expensive if we train the model twice (train in development and in production).

Thank you.

Alberto_Umana · ‎01-17-2025

Hello @ddpotapov,

It is not necessarily required to train the model twice (in development and in production). There are two common patterns for moving ML artifacts through staging and into production:

Deploy Code Approach:

In this pattern, the code to train models is developed in the development environment. The same code moves to staging and then production.
The model is trained in each environment: initially in the development environment as part of model development, in staging (on a limited subset of data) as part of integration tests, and in the production environment (on the full production data) to produce the final model.
This approach allows the model to be trained on production data in the production environment, which can be beneficial if access to production data is restricted.

Deploy Models Approach:

In this pattern, the model artifact is generated by training code in the development environment. The artifact is then tested in the staging environment before being deployed into production.
This approach is suitable when model training is very expensive or hard to reproduce, and it only requires training the model once in the development environment.

The choice between these patterns depends on factors such as the cost of model training, access to production data, and the complexity of the deployment process.

ddpotapov · ‎01-17-2025

Thank you for your answer.
You said:
initially in the development environment as part of model development

What does this mean?

Usually, I take a model, run a lot of training experiments with different hyperparameters. And when I find the best parameters, I train the model one last time to get the best final model. During these experiments, I use all the data I have for training.

In this case, it means that I will have a final trained model in the development environment, and after staging, I have to train the model again in the prod environment with the same data according to the Deploy Code Approach.

Can you clarify this point? Maybe I don't understand something. Thank you.

Databricks Community

Model from code approach

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!