------------------------------------------------------------------------------------------------------------------------------------------------------------
Machine learning (ML) projects rely on numerous interconnected components, including data processing, feature engineering, machine/deep learning, distributed computing, tracking, and model deployment. Each component is commonly associated with its own set of libraries and dependencies, which can change quickly, leading to conflicts and pipeline failures.
Effective dependency management is crucial for the reproducibility, scalability, and maintainability of ML projects. It ensures reliable training, testing, deployment, and monitoring of ML models across various environments. The Databricks Data Intelligence Platform offers tools and features for managing ML project dependencies and implementing Machine Learning Operations (MLOps) principles throughout the ML lifecycle.
In the following sections, we will explore how Databricks incorporates dependency management through its Runtime, cluster types, and scopes. We will also discuss how MLflow improves dependency management by tracking dependencies and leveraging Custom PyFunc Models and Unity Catalog for model versioning, as well as efficient access to pre-installed libraries, which extends to Model Serving for Custom PyFunc Models with packaged dependencies.
Databricks offers two compute options: "Classic" (All-Purpose) and Serverless compute, each influencing workload suitability, control, scalability, manageability, and dependency configuration (see Table 1 in the Appendix for a comparison). The primary distinction lies in the balance between user convenience and control over the environment. "Classic" clusters provide extensive customization, making them ideal for intricate machine learning tasks demanding specific configurations. However, they require more active cluster management. Conversely, Serverless clusters prioritize ease of use and automatic scaling, suiting simpler workloads and ad hoc analyses but with limitations on customization, such as the installation of custom software or libraries. Selecting the appropriate cluster type hinges on the specific workload and the team's expertise. Teams with the necessary skills to manage clusters may prefer "Classic" clusters for their greater customization capabilities. In contrast, Serverless clusters offer simplicity and automatic scaling, trading some customization for increased convenience.
These differences in cluster types have implications for dependency management within the MLOps framework, which generally favors code deployment over model deployment to ensure consistency and reliability. While code deployment might not be optimal for complex Deep Learning and GenAI use cases (which may necessitate alternative MLOps approaches like LLMOps or scenarios where models are promoted across environments), the focus here remains on dependency management within a code deployment-centric MLOps workflow.
The type of cluster you choose will determine how libraries are installed, and distributed, as well as what scenarios they are best suited for. Notebook-scoped Libraries are ideal for experimentation, development, and testing of new library versions, and are often best suited for individual data scientists. Cluster-scoped Libraries are suited for production workflows and shared development environments on dedicated clusters and are also used for standardized team processes. Serverless Notebook Environments are best for production jobs with variable load and cost-optimized workloads, and help to simplify DevOps and maintenance. (Table 2 in the Appendix provides an in-depth overview of the different aspects of Notebook, Cluster, and Serverless library scopes.)
Databricks Runtime (DBR) versions 9.x and later have shifted Databricks' dependency management recommendations away from Conda environments and Docker/Databricks Container Services (DCS) for general ML workflows. This change is driven by Databricks' focus on consistency with tools like %pip and curated runtime ML environments, simplifying MLOps while maintaining reproducibility and governance.
Key factors influencing this shift include:
While Databricks offers alternatives to Conda and Docker/DCS with Cluster Libraries and %pip install, the shift may require adjustment for users familiar with these features.
Below, we provide guidelines and example code snippets of commonly used approaches in managing dependencies when developing ML projects and deploying models on the Databricks Platform.
Figure 1. A decision flow for managing dependencies on Databricks, based on workload type, scenarios, requirements, and available options.
The main recommended approach for managing ML libraries is at the cluster level through the Databricks UI, API, or infrastructure as code. For example (more #[common]# use cases are indicated):
# Example: Installing libraries via the Databricks API
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
cluster_id = "0123-456789-abcdef"
# Install libraries programmatically
w.libraries.install(
cluster_id=cluster_id,
# Different ways to install libraries/packages
libraries=[
# Specify the repository type; package, version required, corresponding repository where the package can be found
{"pypi": {"package": "numpy==required.version.number",
"repo": "http://my-pypi-repo.com"}},
{"pypi": {"package": "scikit-learn==1.3.2"}}, #[common]#
{"maven": {"coordinates": "com.databricks:spark-csv_2.11:1.5.0",
"exclusions": ["org.slf4j:slf4j-log4j12"],
"repo": "http://my-maven-repo.com"}},
{"cran": {"package": "ggplot2",
"repo": "http://cran.us.r-project.org"}},
# Workspace and Volumes URI paths are supported
{"requirements": "/Volumes/path/to/requirements.txt"} #[common]#
]
)
Where users experience long cluster start-up times associated with cluster-scoped libraries, the next section offers some useful ways to speed up library installation during cluster start-up time.
Cluster-scoped libraries install much faster with pre-compiled code (package formats: e.g. posit binary, wheel .whl files) than source builds because they do not need to be compiled during installation, which can be time-consuming for libraries with complex dependencies or native code.
Compiling and/or writing binaries or wheel files to Unity Catalog Volumes can further speed up the installation of cluster-scoped libraries. Unity Catalog volumes enable users to interact with files as though they were local, due to FUSE support. This eliminates the need to first download from the library source and compile during installation, as the pre-compiled files on the volumes path can be referenced for cluster library installation via the UI or programmatically.
# Include Binary/Wheel file(s) programmatically to cluster
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
cluster_id = "0123-456789-abcdef"
# Substitute the following placeholders:
catalog_name = "catalog_name"
schema_name = "schema_name"
volume_name = "volume_name"
wheel_filename = "library-compiled.1.0.0-py3-none-any.whl"
w.libraries.install(
cluster_id=cluster_id,
# Different ways to install binaries/wheel files
libraries=[
{"whl": f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/{wheel_filename}"},
{"cran": {
"package": "ggplot2",
"repo": "https://packagemanager.posit.co/cran/__linux__/jammy/latest"
}}
]
)
Furthermore, the pre-compiled wheel file(s) can also be installed as notebook-scoped libraries, e.g.:
# Install custom package from wheel file in UC Volume as notebook-scoped libraries
# Replace {placeholders} with actual values
%pip install /Volumes/{catalog_name}/{schema_name}/{volume_name}/model-#.#.#-py3-none-any.whl
# Or programmatically
import subprocess
subprocess.check_call([
"pip", "install",
f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/model-#.#.#-py3-none-any.whl"
])
# Example with actual values:
# %pip install /Volumes/ml_catalog/models/wheels/model-1.0.0-py3-none-any.whl
Additionally, pre-trained models, along with their corresponding weights and dependencies, can be compiled and packaged as wheel .whl files, which can be used with load_context() within the MLflow Custom PyFunc Models approach (refer to section 4.4.2.).
For clusters with a large set of libraries that suffer from slow cluster startup times, pre-installing libraries in Unity Catalog Volumes can allow these libraries to be directly imported when the installation path is appended as notebook-scoped libraries via sys.path.append(). For example:
## Step 1: Create a directory on an external UC Volumes path, and/or as an external mount point
# check that libraries to be installed are not already available on DBR to be used
## Step 2: Pre-install libraries to external UC Volumes OR PYTHON_LIB_PATH_MOUNTED
import os
os.environ["UCvols_PYTHON_LIB_PATH"] = "/Volumes/catalog_name/schema_name/preinstalled_libs"
%sh pip install --upgrade pip
# Libraries that are not natively included in e.g. the ML runtime can be installed. Here we also specify the installation target to be a Unity Catalog Volumes path defined above and mapped to the os.environ
%sh pip install --upgrade easydict --target=$UCvols_PYTHON_LIB_PATH
%sh pip install --upgrade torch-scatter --target=$UCvols_PYTHON_LIB_PATH --verbose
%sh pip install --upgrade torch-sparse --target=$UCvols_PYTHON_LIB_PATH --verbose
%sh pip install --upgrade torch-spline-conv --target=$UCvols_PYTHON_LIB_PATH --verbose
# --verbose mainly to check if there are errors and runtime and/or GPU with package version compatibility issues during installation
# e.g. check cuda version for torch_geometric
%sh ls -l /usr/local | grep cuda
%sh pip install torch_geometric --target=$UCvols_PYTHON_LIB_PATH --verbose
# Pre-Installation can take a while... but it's usually done 1x (or updated when needed)
At this point in the above code snippet, the packages are installed to the UC Volume and not yet available to the notebook session. (Note that the installation time is roughly the same as it would take if directly installed from the notebook without a target path). We will need to append the UC Volume path to sys.path so that users can access them from the notebook.
The speedup is observed after the pre-installation to UC Volumes and it is best tested in another notebook with the following steps:
## Step 3: Use within notebooks by adding to sys.path
import sys
sys.path.append("/Volumes/catalog_name/schema_name/preinstalled_libs")
# check appended path:
sys.path
## Step 4: Import as notebook-scoped libraries whenever a cluster is ready
from torch_geometric.data import Data, InMemoryDataset, DataLoader
from torch_geometric.nn import NNConv, BatchNorm, EdgePooling, TopKPooling, global_add_pool
from torch_geometric.utils import get_laplacian, to_dense_adj
It is worth noting that notebook-scoped libraries are session-based, so the pre-installed library path must be re-appended in subsequent sessions. To work around this, the process of appending the pre-install path can be added to a .ipython/profile startup file and sym-linked to the default path ~/.ipython/profile_default/startup using an init.sh script. This allows the preinstalled libraries to be added to the default .ipython profile path during cluster initialization, making them accessible when the cluster is ready, avoiding dependency download and compilation during typical cluster-scoped library installation. An example of this solution can be found here.
In addition to providing the governance, lineage visibility, and assets associated with ML projects, Unity Catalog also helps streamline artifact and dependency management with integrated MLflow. We can indeed track, log, and version data, dependencies, models, and artifacts in Unity Catalog.
For example, we can write data and environment YAML to Unity Catalog and subsequently use these paths as references in the MLflow model logging process. (Example code reference)
import mlflow
from mlflow.models import infer_signature
import pandas as pd
import pyspark.pandas as ps
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score
import os
# Get the user's home directory path
user_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()
# Write data as a delta table to Unity Catalog
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df.rename(
columns={col: col.replace(' (cm)', '').replace(' ', '_') for col in iris_df.columns},
inplace=True
)
iris_df['species'] = iris.target
ps.from_pandas(iris_df).to_table(f"{catalog_name}.{schema_name}.iris", mode="overwrite") # table version could be specified during model logging
# Define the conda environment
conda_env = """
name: mlflow-env
channels:
- defaults
dependencies:
- python=3.8.5
- scikit-learn=0.24.1
- mlflow=2.9.2
- pip
- pip:
- mlflow
- pandas
- pyspark
"""
# Write the conda_env to a UC Volume for subsequent reference in model logging
conda_env_volume_path = f"/Volumes/{catalog_name}/{schema_name}/iris_rfclassifier/conda_env.yaml"
os.makedirs(os.path.dirname(conda_env_volume_path), exist_ok=True)
with open(conda_env_volume_path, "w") as f:
f.write(conda_env)
# Load the Unity Catalog table
dataset = mlflow.data.load_delta(table_name=f"{catalog_name}.{schema_name}.iris", version="0")
pd_df = dataset.df.toPandas()
X = pd_df.drop("species", axis=1)
y = pd_df["species"]
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Data versioning, model dependencies previously defined in conda_env, model parameters, evaluation metrics, and model updates can all be tracked, logged, versioned, as well as registered to Unity Catalog:
# Set registry to Unity Catalog
mlflow.set_registry_uri("databricks-uc")
# Set the experiment explicitly
experiment_path = f"/Users/{user_path}/mlflow_experiments/dependencies/iris_data_rfclassifier"
mlflow.set_experiment(experiment_path)
# Define model hyperparameters
params = {
"n_estimators": 5,
"random_state": 432,
"max_depth": 3,
"min_samples_split": 10,
"min_samples_leaf": 5,
"max_features": "log2",
"bootstrap": True
}
# Train a model, log input table, parameters, metrics etc.
with mlflow.start_run() as run:
# Define the model
rfc = RandomForestClassifier(**params).fit(X_train, y_train)
# Specify the required model input and output schema
signature = infer_signature(X_train, rfc.predict(X_train))
# Take the first row of the training dataset as the model input example.
input_example = X_train.iloc[[0]]
# Log the input dataset and reference it as the 'training' dataset
mlflow.log_input(dataset, "training")
# Log the model parameters
mlflow.log_params(params)
## Track model metrics with experiment run for subsequent comparisons
# Calculate and log training metrics
train_predictions = rfc.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
train_precision = precision_score(y_train, train_predictions, average='weighted')
train_recall = recall_score(y_train, train_predictions, average='weighted')
# Log the training metrics
mlflow.log_metric("train_accuracy", train_accuracy)
mlflow.log_metric("train_precision", train_precision)
mlflow.log_metric("train_recall", train_recall)
# Calculate and log test metrics
test_predictions = rfc.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
test_precision = precision_score(y_test, test_predictions, average='weighted')
test_recall = recall_score(y_test, test_predictions, average='weighted')
# Log the test metrics
mlflow.log_metric("test_accuracy", test_accuracy)
mlflow.log_metric("test_precision", test_precision)
mlflow.log_metric("test_recall", test_recall)
# Log the model and register it as a new version in UC
mlflow.sklearn.log_model(
sk_model=rfc,
artifact_path="sklearn-rfclassifier-model",
signature=signature,
input_example=input_example,
conda_env=conda_env_volume_path,
registered_model_name="mmt_demos.dependencies.iris_rfclassifier",
)
# [Alternatively] Register outside of model logging
model_uri = f"runs:/{run.info.run_id}/sklearn-rfclassifier-model"
mv = mlflow.register_model(model_uri,
f"{catalog_name}.{schema_name}.iris_rfclassifier"
)
Centralizing all related model development information allows for easy tracking of relevant assets and data sources for experiment runs, which simplifies debugging and comparisons (e.g. different data transforms, model hyperparameters, model type or flavor) where needed.
Deep learning and GenAI workloads often require specialized dependency management due to their complexity and resource requirements. Common approaches are highlighted below:
# Creating a GPU cluster with specialized DL libraries
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
cluster_id = w.clusters.create(
cluster_name="gpu-dl_or_genai-cluster",
spark_version="14.3.x-gpu-ml-scala2.12", # GPU-enabled ML runtime
node_type_id="Standard_NC24ads_A100_v4", # GPU instance
num_workers=0, # Single-node for DL
spark_conf={
"spark.databricks.cluster.profile": "singleNode"
},
custom_tags={
"ResourceClass": "SingleNode"
}
).result().cluster_id
# Install specialized libraries
w.libraries.install(
cluster_id=cluster_id,
libraries=[
{"pypi": {"package": "torch==2.1.0"}},
{"pypi": {"package": "transformers==4.34.0"}},
{"pypi": {"package": "accelerate==0.23.0"}},
# ...
{"requirements": "/Volumes/path/to/dl_or_genai_project/requirements.txt"}
]
)
The Faster-Library-Loads approach (noted previously in section 4.2.2.) can also be applicable if a long list of deep learning or GPU-related dependencies is needed.
When developing Deep learning and/or GenAI applications that require training and/or fine-tuning on custom data, users can leverage large transformer models like those available on Hugging Face. Example transformer models used in the Life Sciences include Evolutionary Scale Modeling (ESM) and Geneformer.
To leverage such models on the Databricks Platform, we recommend users wrap these as MLflow Custom Model PyFunc along with required dependencies, which allows these to be logged and registered as models in Unity Catalog:
# Custom PyFunc wrapper for a large transformer model
import mlflow
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, logging
import shutil
import pandas as pd
from mlflow.models.signature import infer_signature
# Set the logging level to ERROR to disable verbose messages
# logging.set_verbosity_error()
# Define a Custom wrapper class for the transformer model
class ESMWrapper(mlflow.pyfunc.PythonModel):
# load_context: Loads the model and tokenizer from the provided artifacts and sets up the device (CPU or GPU).
def load_context(self, context):
# Load ESM model from saved files in the artifact
# self.artifacts = context.artifacts -- this is assumed
# Determine the device to use (GPU if available, otherwise CPU)
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the tokenizer & model from the provided artifacts
self.tokenizer = AutoTokenizer.from_pretrained(context.artifacts["tokenizer"])
self.model = AutoModelForMaskedLM.from_pretrained(context.artifacts["model"])
# Move the model to the appropriate device (GPU or CPU)
self.model.to(self.device)
# Set the model to evaluation mode
self.model.eval()
# Set special tokens if they are not set; ensure that the beginning-of-sequence (bos_token) and separator (sep_token) tokens are set. If they are not already set, they are assigned the value of the cls_token (classification token).
if self.tokenizer.bos_token is None:
self.tokenizer.bos_token = self.tokenizer.cls_token
if self.tokenizer.sep_token is None:
self.tokenizer.sep_token = self.tokenizer.cls_token
# Define the predict function which takes input sequences, tokenizes them, runs them through the model, and returns the embeddings
def predict(self, context, model_input):
protein_sequences = model_input["sequences"]
results = []
# Process each sequence
for seq in protein_sequences:
inputs = self.tokenizer(seq, return_tensors="pt")
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(**inputs, output_hidden_states=True)
# Process outputs as needed for your application
embeddings = outputs.hidden_states[-1].mean(dim=1).cpu().numpy()
results.append(embeddings)
return results
MLflow model logging allows specifying the environment dependencies:
# Log the model with explicit dependencies
with mlflow.start_run():
# Download and Save Model Components:
model_name = "facebook/esm2_t33_650M_UR50D" #https://huggingface.co/facebook/esm2_t33_650M_UR50D
# Specifies the model name and paths to save the model and tokenizer.
model_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/tmp_model"
tokenizer_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/tmp_tokenizer"
# Download the pre-trained model and tokenizer from Hugging Face.
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Saves the model and tokenizer to the specified paths.
model.save_pretrained(model_path, safe_serialization=False)
tokenizer.save_pretrained(tokenizer_path)
# Define conda env with necessary dependencies
conda_env = {
"channels": ["defaults", "conda-forge", "pytorch"],
"dependencies": [
"python=3.11", # compute 15.4LTSMLR
"pip>=22.0.4",
{"pip": [
"torch==2.1.0",
"transformers==4.34.0",
"accelerate==0.23.0",
"cloudpickle==3.1.1", #compute 15.4LTSMLR
]}
],
"name": "esm_env"
}
# Create a sample input DataFrame to infer the input and output signature of the model.
sample_input = pd.DataFrame({"sequences": ["MKTAYIAKQRQISFVKSHFSRQDILDLWIYHTQGYFPDWQNYG"]})
## Initialize the wrapper and load the context manually for signature inference
# Initialize an instance of the ESMWrapper class.
esm_wrapper = ESMWrapper()
# Manually set the tokenizer and model for the wrapper.
esm_wrapper.tokenizer = tokenizer
esm_wrapper.model = model
# Determine the device (GPU or CPU) and moves the model to the appropriate device.
esm_wrapper.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
esm_wrapper.model.to(esm_wrapper.device)
# Set the model to evaluation mode.
esm_wrapper.model.eval()
# Use the wrapper to predict the output for the sample input
sample_output = esm_wrapper.predict(None, sample_input)
# Infer the input and output signature of the model using the sample input and output
signature = infer_signature(sample_input, sample_output)
# Log the model with MLflow, including the artifacts (model and tokenizer paths), Conda environment, signature, and input example.
mlflow.pyfunc.log_model(
artifact_path="esm_model",
python_model=ESMWrapper(),
artifacts={
"model": model_path,
"tokenizer": tokenizer_path,
"requirements": requirements_path
},
conda_env=conda_env,
signature=signature,
input_example=sample_input,
# Register the model with a specified name.
registered_model_name=f"{catalog_name}.{schema_name}.esm_protein_model"
)
When Custom PyFunc Models are UC-registered, you can further serve these models with their MLflow-packaged dependencies as endpoints for your applications. (Example code reference.)
# Deploying a large model to Model Serving with GPU
from mlflow.deployments import get_deploy_client
client = get_deploy_client("databricks")
# Define the full API request payload
endpoint_config = {
"name": general_model_name,
"served_models": [
{
"model_name": registered_model_name,
"model_version": latest_model_version,
"workload_size": workload_size, # defines concurrency: Small/Medium/Large
"workload_type": workload_type, # defines compute: GPU_SMALL/GPU_MEDIUM/GPU_LARGE
"scale_to_zero_enabled": True
}
],
"traffic_config": {
"routes": [
{
"served_model_name": general_model_name,
"traffic_percentage": 100
}
]
},
"auto_capture_config": {
"catalog_name": catalog_name,
"schema_name": schema_name,
"enabled": True
},
"tags": {
"project": "esm_protein_model",
}
}
# Create or update the endpoint
try:
# Check if endpoint exists
existing_endpoint = client.get_endpoint(endpoint_name)
print(f"Endpoint {endpoint_name} exists, updating configuration...")
client.update_endpoint_config(endpoint_name, endpoint_config)
except Exception as e:
if "RESOURCE_DOES_NOT_EXIST" in str(e):
print(f"Creating new endpoint {endpoint_name}...")
client.create_endpoint(endpoint_name, endpoint_config)
else:
raise
While not recommended for general use, there are specific scenarios where Conda and Docker still play important roles in the Databricks ecosystem.
As observed in Section 4.4.2. the Custom PyFunc Model's required dependencies are specified using conda_env. For MLflow applications that rely on Conda as the env_manager to capture environment dependencies, users are reminded to make a note of Anaconda license requirements.
# When logging models, MLflow captures the Conda environment
with mlflow.start_run():
model = train_model()
# Define explicit Conda environment for reproducibility
conda_env = {
"channels": ["defaults", "conda-forge"],
"dependencies": [
f"python={sys.version.split()[0]}",
"scikit-learn=1.3.2",
{"pip": ["xgboost==1.7.6"]}
]
}
# Log model with Conda environment
mlflow.sklearn.log_model(
model,
"model",
conda_env=conda_env,
registered_model_name="catalog_name.schema_name.my_model"
)
Docker containers remain valuable for specific scenarios:
# Example Cluster Config JSON: Using a custom Docker container with specialized libraries on a standard cluster definition setup
{
"cluster_name": "BioNemoDockerCluster",
"spark_version": "14.3.x-scala2.12",
"spark_conf": {"spark.databricks.unityCatalog.volumes.enabled": "true"},
"aws_attributes": {"zone_id": "us-west-2c"}, //help to avoid capacity limits
"node_type_id": "g5.12xlarge", //Specifies the type of EC2 instance to use, which is "g5.12xlarge" (A10G GPU instance)
"custom_tags": {"removeAfter": "yyy-mm-dd",},
"autotermination_minutes": 120,
"enable_elastic_disk": true, //allow the cluster to dynamically increase disk space as needed
"docker_image": {"url": "{docker_profile}/bionemo_dbx_v0_amd64:latest",
//docker built with -platform amd64
"basic_auth": { "username": "{{secrets/<scope>/docker_PAT_user}}",
"password": "{{secrets/<scope>/docker_PAT_pw}}"
},
"single_user_name": "{UUID_{groupname}_SP}",
//here we specify a group_level Service Principal
"data_security_mode": "DATA_SECURITY_MODE_DEDICATED",
"runtime_engine": "STANDARD",
"kind": "CLASSIC_PREVIEW",
"use_ml_runtime": false,
"is_single_node": true,
"num_workers": 0,
"apply_policy_default_values": false
}
}
Note that the docker image built and used on the Databricks cluster will need to include relevant framework dependencies for the intended runtime; a reference is provided, e.g., for ubuntu-gpu-Docker. Given that there is no timely rollout nor maintenance of docker image associated with each existing or new runtime, nor is backward compatibility guaranteed, developers will be responsible for updating dependencies and troubleshooting locally prior to testing them out.
Library conflicts can occasionally occur due to version incompatibility or other compatibility issues between the runtime and custom-installed Python libraries installed as workspace files (NB: workspace file-size limits). It can be helpful to appreciate that the precedence of Python libraries is determined by the order in which Python libraries are added to the Python sys.path in workspace files. When the import <library> command is run, libraries installed in current Databricks Git folders are prioritized. Notebooks outside Git folders add the current working directory after other libraries are installed, while Workspace directories that are manually appended have the lowest priority.
For automating the provisioning and configuration of Databricks environments, and ensuring consistency and repeatability, Databricks Terraform provider lets users manage Databricks infrastructure using Terraform, an Infrastructure as code (IaC) tool. For example:
# Terraform configuration for a cluster with dependencies
resource "databricks_cluster" "ml_training_cluster" {
cluster_name = "ml-training-cluster"
spark_version = "14.3.x-cpu-ml-scala2.12"
node_type_id = "Standard_DS3_v2"
autotermination_minutes = 20
num_workers = 2
spark_conf = {"spark.databricks.cluster.profile" = "singleNode"}
# Cluster libraries
library {pypi {package = "scikit-learn==1.3.2"}}
library {
pypi {package = "mlflow==2.8.0"}}
library {whl = "/Volumes/catalog_name/schema_name/wheels/custom_lib-1.0.0-py3-none-any.whl"}}
Automation with Terraform also facilitates the management of dependencies across multiple environments, simplifies maintenance, and reduces the risk of configuration drift. (Additional examples are listed here.)
In fact, you can manage complex Databricks ML projects where multiple contributors and automation are essential, and continuous integration and deployment (CI/CD) is required with an IaC approach by leveraging Databricks Asset Bundles (DABs). For example:
example-repo/
|---- .github/
| |---- workflows/
| |---- deploy-dab.yml
|---- databricks.yml # This file specifies the complete DAB bundle definition
#-----------------------------------------------------------------------
# Note: Define your DAB bundle within the databricks.yml file with resources like:
# - jobs, pipelines, and workflows
# - notebooks and their locations
# - ML models and endpoints
# - cluster configurations
# - dependencies and libraries
# Without this definition, the CI/CD pipeline has nothing to deploy.
#-----------------------------------------------------------------------
# .github/workflows/deploy-dab.yml
name: Deploy DAB
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install Databricks CLI
run: pip install databricks-cli
- name: Deploy DAB
run:
databricks bundle deploy
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
Using DABs, Databricks resources like jobs, pipelines, notebooks, dependencies, etc. can be defined as source files. These files provide a complete project definition, including structure, testing, and deployment processes. This comprehensive approach simplifies collaboration throughout the project's development lifecycle. Example code snippet:
# databricks.yml for a DAB containing model training resources e.g.
jobs:
train_model_job:
name: ${bundle.target}-Training-Job
job_clusters:
- job_cluster_key: "training_cluster"
cluster_key: "training_cluster"
tasks:
- task_key: "train_model"
notebook_task:
notebook_path: "../training/notebooks/train.py"
base_parameters:
input_table_name: "${bundle.target}.project_schema_name.input_data"
libraries:
- whl: ./my-wheel-0.1.0.whl
- whl: /Workspace/Shared/Libraries/my-wheel-0.0.1-py3-none-any.whl
- whl: /Volumes/main/default/my-volume/my-wheel-0.1.0.whl
job_cluster_key: "training_cluster" # Reference to the job cluster
clusters:
training_cluster:
spark_version: "14.3.x-cpu-ml-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 4
data_security_mode: "SINGLE_USER"
As discussed in the previous sections, effective dependency management is essential for reliable MLOps workflows. Figure 2 visually represents the dependency management process throughout the MLOps lifecycle, from development to production deployment. Based on the guidance discussed, the following summary outlines best practices for managing dependencies on Databricks.
Figure 2. Dependency management within the broader MLOps lifecycle, illustrating how approaches evolve from development through production, and highlighting optimization techniques that can be applied throughout the process.
Through platform-native integrations with Unity Catalog and MLflow, Databricks dependency management aims to enhance governance, performance, and operational ease. Approaches like Faster-Library-Loads with volume-stored pre-installed packages can significantly improve performance when optimizing cluster startup with many libraries.
As workloads grow in complexity, especially with deep learning and GenAI applications, dependency management requires additional considerations for GPU support, model size, and specialized libraries. Using infrastructure as code with Terraform and DABs can help ensure that these dependencies are consistently applied across environments in your MLOps workflows.
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aspect |
“Classic” / All-Purpose (Standard | ML Runtime) |
|
Cluster Management |
Requires user setup and management, including configuring hardware and software, provisioning resources, and managing scaling. |
Fully managed. Databricks handles provisioning, scaling, and infrastructure management automatically, freeing users from operational overhead. |
Resource Customization |
Allows fine-grained control over computing resources (e.g., node sizes, instance types, and configurations). Suitable for users who need specific hardware/software setups. |
Limited customization of resources—users cannot configure individual node types or resource settings in detail. Focuses on simplicity and abstraction. |
Scalability |
Requires explicit configuration to scale nodes/processes, which may require manual intervention or automation via scripting. |
Automatically scales resources up or down based on workload, with minimal user intervention. Enables pay-as-you-go scalability. |
Cost Efficiency |
May be less cost-efficient for intermittent or unpredictable workloads because resources remain reserved even when idle. |
More cost-efficient for variable and unpredictable workloads due to automatic scaling and serverless architecture—pay for resources only while they’re in use. |
Ease of Use |
Requires more expertise to set up, configure, and optimize, particularly for advanced ML workloads or unusual hardware/software requirements. |
Designed for ease of use—ideal for users/team members without extensive expertise in cluster management or infrastructure workflows. |
Performance for Complex Workloads |
Tailored for complex workloads, long-running jobs, or specialized environments that require specific library versions, GPU instances, or high-compute nodes. |
Optimized for simpler, on-demand workloads such as short-running jobs, lightweight data processing, or proof-of-concept pipelines. |
Typical Use Case / Scenario |
Processing pipelines that require specific software versions or libraries. Machine / Deep learning tasks that benefit from specialized hardware acceleration e.g. CUDA |
Ad-hoc / on-demand workloads needing minimal configuration and short-lived runtime: e.g. Experimentation with data and code without long-term resource commitment. |
Dependency Management |
It provides full control over dependency installation and configuration via cluster-scoped libraries, links to standard package repositories, such as PyPi, Cran/Posit, and Maven, including Jar/Wheel files, as well as init scripts. |
Automatically manages common dependencies; May require notebook-scoped libraries or additional configuration for custom dependencies. Given limitations, Standard or ML Runtime clusters may be a better option for workloads requiring currently unsupported features (e.g., GPUs, ML Runtimes, or specific Spark functionalities) |
[return to content or return to top of page]
Aspect |
|||
Worker Distribution |
Installed on-demand on workers executing the notebook. Each worker maintains its own copy. Distributed via Spark's internal mechanism. |
Uniformly installed on all nodes during cluster startup. Guaranteed availability on all executors. Consistent across the entire cluster. |
Pre-installed in serverless compute environment. No user control over worker configuration. Managed by Databricks platform. |
Node Consistency |
Potential inconsistencies if workers join/leave during execution. New workers need to install libraries when joining. |
All nodes have identical library configurations. Auto-scaling nodes automatically receive the same libraries. Consistent environment across restarts |
Consistent across all serverless workers. Fully managed scaling with identical environments. No node-specific configurations possible |
Driver vs. Worker |
Installed first on driver node. Then propagated to workers as needed. Potential for environment differences. |
Identical setup on driver and all workers. No discrepancies between nodes. Predictable behavior across the cluster. |
Identical environment across all compute resources. Managed driver/worker configuration. No visibility into the underlying infrastructure. |
User Isolation |
Each user can have their own library versions. User A's libraries don't affect User B. This prevents "dependency hell" in shared clusters. |
All users share the same libraries and versions. Changes affect all users. Potential conflicts between user requirements. |
Each serverless job has isolated environment. No cross-contamination between workloads. Fully isolated execution environments. |
Collaboration |
Dependencies must be explicitly documented. Different users may get different results. Requires explicit sharing of dependency info. |
Consistent environment for all collaborators. Predictable behavior across users. Simplified sharing of notebooks. |
Consistent behavior for all users accessing the same endpoint. Dependencies defined in job configuration. Simplified governance and standardization |
Permissions |
Any user can install without admin privileges. Flexible for individual dependency management. No approval process. needed |
Typically requires cluster admin privileges. Centralized control over available libraries. Can enforce organizational standards |
Requires permission to modify job definitions. Centralized management via workspace settings. Admin-controlled library whitelisting is possible. |
Resource Usage |
May duplicate libraries across notebooks. Higher memory usage with multiple versions. Installation impacts notebook startup time |
Single shared installation. More memory-efficient. No runtime installation overhead. |
No resource overhead for library installation. An optimized environment with a minimal footprint. Pay only for actual compute time used |
Performance Impact |
Installation during notebook execution. Can slow down initial notebook cells. May cause timeouts with complex dependencies. |
Increases cluster startup time. Front-loads installation cost. Better runtime performance. |
No installation overhead. Instant startup with pre-configured environment. Optimized for short-running jobs. |
Library Management |
Managed within notebook code. Simple pip commands. Easy to version control with the notebook. |
Managed through UI or API, it requires a cluster restart to apply. It can be automated with infrastructure as code. It is defined in job configuration. |
Version-pinned in deployment specs. Supports wheel files, PyPI, Maven, etc. |
Scaling Behavior |
Libraries must be installed on new workers, which can slow down elastic scaling. Installation time increases with library complexity. |
Increases cluster startup time. Front-loads installation cost. Better runtime performance Libraries are pre-installed on all nodes. Consistent during autoscaling. Slower initial cluster startup. |
Instant scaling with no library installation. Consistent performance during scale-out. Optimized for variable workloads. |
Best For |
Experimentation and development. Individual data scientists. Testing new library versions. |
Production workflows on dedicated clusters. Shared development environments. Standardized team processes. |
Production jobs with variable load. Cost-optimized workloads. Simplified DevOps and maintenance. |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.