Databricks Community

invalidargument · ‎01-18-2023

Hi.

We have around 30 models in model storage that we use for batch scoring. These are created at different times by different person and on different cluster run times.

Now we have run into problems that we can't de-serialize the models and use for inference since there are missmatched version of spark and/or sklearn.

What I've tried:

using requirements.txt from respective model together with pip install

Problems with this solution:
1. it's not possible to change spark version on a cluster with pip install, and there are depencies on spark for desrialization of the model
2. sometimes the autogenerated requirements.txt from mlflow.log_model() contains incompatible package version and pip install exits with an error code

My question are

is some recommended way of handling (batch) scoring and keeping track of the combination of cluster runtime and requirements for each model, is there any databricks documentation i can read?
Can I find in model registry with clusterid or runtime a model was created on?

Thanks.

Anonymous · ‎04-10-2023

@Jonas Lindberg :

To address the issues you are facing with model serialization and versioning, I would recommend the following approach:

Use MLflow to manage the lifecycle of your models, including versioning, deployment, and monitoring. MLflow is an open-source platform that provides tools for tracking experiments, packaging code into reproducible runs, and sharing and deploying models. By using MLflow to manage your models, you can ensure that each model is associated with the right set of dependencies and runtime environment.
Define a consistent environment for your models. Use a requirements file to specify the packages and dependencies required by your models, and ensure that each model is associated with the right version of Spark and Scikit-learn. This will help ensure that the models can be deserialized and used for inference. You can also create a Docker image with the required environment and deploy your models as containers to ensure consistency across different runtime environments.
Use Databricks Jobs to run your batch scoring jobs, and specify the runtime environment and dependencies for each job. Databricks Jobs allow you to specify the cluster and runtime environment for your jobs, as well as the input data and output location. You can also use Databricks Jobs to schedule and monitor your jobs.
Use MLflow Model Registry to manage the deployment of your models to different environments, such as production or staging. MLflow Model Registry allows you to track the lifecycle of your models, including versioning, deployment, and monitoring. You can also use Model Registry to promote models from staging to production, and to track the performance of your models over time.

To answer your specific questions:

Databricks provides documentation on best practices for managing ML models, including versioning, deployment, and monitoring. You can refer to the following resources for more information:

MLflow documentation: https://mlflow.org/docs/latest/index.html
Databricks documentation on machine learning: https://docs.databricks.com/applications/machine-learning/index.html

MLflow Model Registry allows you to track the version of Spark and Scikit-learn used to create each model, as well as the runtime environment and dependencies. You can use Model Registry to search for models based on these attributes, and to deploy models to specific runtime environments.

Databricks Community

Model storage requirements management

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples