cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Model storage requirements management

invalidargument
New Contributor II

Hi.

We have around 30 models in model storage that we use for batch scoring. These are created at different times by different person and on different cluster run times.

Now we have run into problems that we can't de-serialize the models and use for inference since there are missmatched version of spark and/or sklearn.

What I've tried:

  • using requirements.txt from respective model together with pip install
  1. Problems with this solution:
    1. it's not possible to change spark version on a cluster with pip install, and there are depencies on spark for desrialization of the model
    2. sometimes the autogenerated requirements.txt from mlflow.log_model() contains incompatible package version and pip install exits with an error code

My question are

  1. is some recommended way of handling (batch) scoring and keeping track of the combination of cluster runtime and requirements for each model, is there any databricks documentation i can read?
  2. Can I find in model registry with clusterid or runtime a model was created on?

Thanks.

1 REPLY 1

Anonymous
Not applicable

@Jonas Lindberg​ :

To address the issues you are facing with model serialization and versioning, I would recommend the following approach:

  1. Use MLflow to manage the lifecycle of your models, including versioning, deployment, and monitoring. MLflow is an open-source platform that provides tools for tracking experiments, packaging code into reproducible runs, and sharing and deploying models. By using MLflow to manage your models, you can ensure that each model is associated with the right set of dependencies and runtime environment.
  2. Define a consistent environment for your models. Use a requirements file to specify the packages and dependencies required by your models, and ensure that each model is associated with the right version of Spark and Scikit-learn. This will help ensure that the models can be deserialized and used for inference. You can also create a Docker image with the required environment and deploy your models as containers to ensure consistency across different runtime environments.
  3. Use Databricks Jobs to run your batch scoring jobs, and specify the runtime environment and dependencies for each job. Databricks Jobs allow you to specify the cluster and runtime environment for your jobs, as well as the input data and output location. You can also use Databricks Jobs to schedule and monitor your jobs.
  4. Use MLflow Model Registry to manage the deployment of your models to different environments, such as production or staging. MLflow Model Registry allows you to track the lifecycle of your models, including versioning, deployment, and monitoring. You can also use Model Registry to promote models from staging to production, and to track the performance of your models over time.

To answer your specific questions:

  1. Databricks provides documentation on best practices for managing ML models, including versioning, deployment, and monitoring. You can refer to the following resources for more information:
  1. MLflow Model Registry allows you to track the version of Spark and Scikit-learn used to create each model, as well as the runtime environment and dependencies. You can use Model Registry to search for models based on these attributes, and to deploy models to specific runtime environments.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.