cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Deployment as code pattern with double training effort?

datastones
New Contributor II

Hi everybody, I have a question re: the deployment as code pattern on databricks. I found and watched a great demo here: https://www.youtube.com/watch?v=JApPzAnbfPI

My question is, in the case where I can get read access to prod data in dev env, the deployment as code pattern would basically require us to re-do the training all over again on the same prod data that we used for training in dev. This double training effort seems quite redundant to me. Let's assume that the model is relatively time consuming to train, how exactly can we address this with the deployment as code pattern?

Could anyone educate me what I am missing here? Thank you very much!

As companies roll out ML pervasively, operational concerns become the primary source of complexity. Machine Learning Operations (MLOps) has emerged as a practice to manage this complexity. At Databricks, we see firsthand how customers develop their MLOps approaches across a huge variety of teams ...
1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @datastonesThere are a couple of ways to address the redundant model retraining when using the deployment as code pattern on Databricks:

  1. Use the "deploy models" paradigm instead of "deploy code"
    • In this approach, you develop and train the model in the dev environment
    • Test the resulting model artifact in staging
    • Promote the pre-trained model to production
    • Avoids retraining the same model multiple times
    • However, you lose some of the lineage and traceability benefits of deploying the training code
  2. Use a central model registry accessible from all environments
    • Store the trained model in a central model registry
    • Reference that registry from the dev, staging and prod workspaces
    • Allows reusing the same trained model without retraining
    • But you may lose some of the lineage if the model registry is not in the same workspace as the production serving
  3. Use synthetic or obfuscated data in dev/staging if full prod data access is restricted
    • Provision a subset or synthetic version of production data for dev and staging
    • Allows testing the training code without retraining on the full prod data
    • But you lose the ability to fully test on realistic data
  4. Maintain a "scratch" dev data space separate from the production data mirror
    • Have a dev data space for exploratory work and testing pipelines
    • Keep this separate from the production data mirror used for final training
    • Allows iterating on feature engineering and model tuning without redundant training

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @datastonesThere are a couple of ways to address the redundant model retraining when using the deployment as code pattern on Databricks:

  1. Use the "deploy models" paradigm instead of "deploy code"
    • In this approach, you develop and train the model in the dev environment
    • Test the resulting model artifact in staging
    • Promote the pre-trained model to production
    • Avoids retraining the same model multiple times
    • However, you lose some of the lineage and traceability benefits of deploying the training code
  2. Use a central model registry accessible from all environments
    • Store the trained model in a central model registry
    • Reference that registry from the dev, staging and prod workspaces
    • Allows reusing the same trained model without retraining
    • But you may lose some of the lineage if the model registry is not in the same workspace as the production serving
  3. Use synthetic or obfuscated data in dev/staging if full prod data access is restricted
    • Provision a subset or synthetic version of production data for dev and staging
    • Allows testing the training code without retraining on the full prod data
    • But you lose the ability to fully test on realistic data
  4. Maintain a "scratch" dev data space separate from the production data mirror
    • Have a dev data space for exploratory work and testing pipelines
    • Keep this separate from the production data mirror used for final training
    • Allows iterating on feature engineering and model tuning without redundant training

Thank you very much for your help, Fatma! I'll take those considerations into account. Cheers.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!