cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
mmt
Databricks Employee
Databricks Employee

------------------------------------------------------------------------------------------------------------------------------------------------------------

When transitioning from developing ML locally to developing ML in the cloud on Databricks, one may find that dependency and environment management are handled quite differently. Tools like Conda and Docker, which are commonplace in local development, don't directly translate to Databricks' unified environment. This post aims to make that transition easier for those familiar with local development by sharing insights and guidelines.   

------------------------------------------------------------------------------------------------------------------------------------------------------------

1. The Role of Dependencies in MLOps

Machine learning (ML) projects rely on numerous interconnected components, including data processing, feature engineering, machine/deep learning, distributed computing, tracking, and model deployment. Each component is commonly associated with its own set of libraries and dependencies, which can change quickly, leading to conflicts and pipeline failures.

Effective dependency management is crucial for the reproducibility, scalability, and maintainability of ML projects. It ensures reliable training, testing, deployment, and monitoring of ML models across various environments. The Databricks Data Intelligence Platform offers tools and features for managing ML project dependencies and implementing Machine Learning Operations (MLOps) principles throughout the ML lifecycle. 

In the following sections, we will explore how Databricks incorporates dependency management through its Runtime, cluster types, and scopes. We will also discuss how MLflow improves dependency management by tracking dependencies and leveraging Custom PyFunc Models and Unity Catalog for model versioning, as well as efficient access to pre-installed libraries, which extends to Model Serving for Custom PyFunc Models with packaged dependencies.

 

2. Selecting the Optimal Cluster Type 

Databricks offers two compute options: "Classic" (All-Purpose) and Serverless compute, each influencing workload suitability, control, scalability, manageability, and dependency configuration (see Table 1 in the Appendix for a comparison). The primary distinction lies in the balance between user convenience and control over the environment. "Classic" clusters provide extensive customization, making them ideal for intricate machine learning tasks demanding specific configurations. However, they require more active cluster management. Conversely, Serverless clusters prioritize ease of use and automatic scaling, suiting simpler workloads and ad hoc analyses but with limitations on customization, such as the installation of custom software or libraries. Selecting the appropriate cluster type hinges on the specific workload and the team's expertise. Teams with the necessary skills to manage clusters may prefer "Classic" clusters for their greater customization capabilities. In contrast, Serverless clusters offer simplicity and automatic scaling, trading some customization for increased convenience.

These differences in cluster types have implications for dependency management within the MLOps framework, which generally favors code deployment over model deployment to ensure consistency and reliability. While code deployment might not be optimal for complex Deep Learning and GenAI use cases (which may necessitate alternative MLOps approaches like LLMOps or scenarios where models are promoted across environments), the focus here remains on dependency management within a code deployment-centric MLOps workflow.

The type of cluster you choose will determine how libraries are installed, and distributed, as well as what scenarios they are best suited for. Notebook-scoped Libraries are ideal for experimentation, development, and testing of new library versions, and are often best suited for individual data scientists. Cluster-scoped Libraries are suited for production workflows and shared development environments on dedicated clusters and are also used for standardized team processes. Serverless Notebook Environments are best for production jobs with variable load and cost-optimized workloads, and help to simplify DevOps and maintenance. (Table 2 in the Appendix provides an in-depth overview of the different aspects of Notebook, Cluster, and Serverless library scopes.)

[return to top of page]

 

3. Shift Toward Integrated Platform-Native MLOps

Databricks Runtime (DBR) versions 9.x and later have shifted Databricks' dependency management recommendations away from Conda environments and Docker/Databricks Container Services (DCS) for general ML workflows. This change is driven by Databricks' focus on consistency with tools like %pip and curated runtime ML environments, simplifying MLOps while maintaining reproducibility and governance.

Key factors influencing this shift include:

  • Complexity: Conda's dependency conflicts, installation times, and manual environment setup create operational overhead. Docker/DCS requires creating and maintaining custom container images, which may not align with Databricks' runtime optimizations.
  • Performance Impact: Docker and Conda can increase cluster startup times due to dependency resolution or large container downloads.
  • Native Databricks Solutions: Tools like Cluster Libraries, optimized Prebuilt ML Runtime, and Databricks Asset Bundles (DABs) offer faster, more scalable dependency management that integrates easily into production MLOps pipelines. They also provide better governance through integration with Unity Catalog and a consistent development experience across environments.

While Databricks offers alternatives to Conda and Docker/DCS with Cluster Libraries and %pip install, the shift may require adjustment for users familiar with these features.

[return to top of page]

 

4. Platform-Native Approaches  

Below, we provide guidelines and example code snippets of commonly used approaches in managing dependencies when developing ML projects and deploying models on the Databricks Platform.

mmt_0-1747088202076.png

Figure 1. A decision flow for managing dependencies on Databricks, based on workload type, scenarios, requirements, and available options. 

[return to top of page]

4.1. Cluster-Scoped Library Management

The main recommended approach for managing ML libraries is at the cluster level through the Databricks UI, API, or infrastructure as code. For example (more #[common]# use cases are indicated):

# Example: Installing libraries via the Databricks API
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
cluster_id = "0123-456789-abcdef"

# Install libraries programmatically
w.libraries.install(
    cluster_id=cluster_id,
    # Different ways to install libraries/packages  
    libraries=[        
        # Specify the repository type; package, version required, corresponding repository where the package can be found
        {"pypi": {"package": "numpy==required.version.number", 
         "repo": "http://my-pypi-repo.com"}},
        {"pypi": {"package": "scikit-learn==1.3.2"}}, #[common]#
  
        {"maven": {"coordinates": "com.databricks:spark-csv_2.11:1.5.0",
         "exclusions": ["org.slf4j:slf4j-log4j12"],
         "repo": "http://my-maven-repo.com"}},

        {"cran": {"package": "ggplot2",
         "repo": "http://cran.us.r-project.org"}}, 
 
        # Workspace and Volumes URI paths are supported
        {"requirements": "/Volumes/path/to/requirements.txt"} #[common]#
              ]
             )

Where users experience long cluster start-up times associated with cluster-scoped libraries, the next section offers some useful ways to speed up library installation during cluster start-up time.

[return to top of page]

4.2. Efficient Library Management for Clusters

4.2.1. Use pre-compiled Binary Packages or Wheel .whl files where available

Cluster-scoped libraries install much faster with pre-compiled code (package formats: e.g. posit binary, wheel .whl files) than source builds because they do not need to be compiled during installation, which can be time-consuming for libraries with complex dependencies or native code.

Compiling and/or writing binaries or wheel files to Unity Catalog Volumes can further speed up the installation of cluster-scoped libraries. Unity Catalog volumes enable users to interact with files as though they were local, due to FUSE support. This eliminates the need to first download from the library source and compile during installation, as the pre-compiled files on the volumes path can be referenced for cluster library installation via the UI or programmatically.

# Include Binary/Wheel file(s) programmatically to cluster 
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
cluster_id = "0123-456789-abcdef"

# Substitute the following placeholders:
catalog_name = "catalog_name"
schema_name = "schema_name"
volume_name = "volume_name"
wheel_filename = "library-compiled.1.0.0-py3-none-any.whl"

w.libraries.install(
    cluster_id=cluster_id,
    # Different ways to install binaries/wheel files
    libraries=[
        {"whl": f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/{wheel_filename}"},
        {"cran": {
            "package": "ggplot2",
            "repo": "https://packagemanager.posit.co/cran/__linux__/jammy/latest"
        }}
    ]
)

Furthermore, the pre-compiled wheel file(s) can also be installed as notebook-scoped libraries, e.g.: 

# Install custom package from wheel file in UC Volume as notebook-scoped libraries 
# Replace {placeholders} with actual values 
%pip install /Volumes/{catalog_name}/{schema_name}/{volume_name}/model-#.#.#-py3-none-any.whl  

# Or programmatically
import subprocess
subprocess.check_call([
    "pip", "install", 
    f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/model-#.#.#-py3-none-any.whl"
])

# Example with actual values:
# %pip install /Volumes/ml_catalog/models/wheels/model-1.0.0-py3-none-any.whl

Additionally, pre-trained models, along with their corresponding weights and dependencies, can be compiled and packaged as wheel .whl files, which can be used with load_context() within the MLflow Custom PyFunc Models approach (refer to section 4.4.2.).

4.2.2. Faster-Library-Loads Approach

For clusters with a large set of libraries that suffer from slow cluster startup times, pre-installing libraries in Unity Catalog Volumes can allow these libraries to be directly imported when the installation path is appended as notebook-scoped libraries via sys.path.append(). For example: 

## Step 1: Create a directory on an external UC Volumes path, and/or as an external mount point 
# check that libraries to be installed are not already available on DBR to be used

## Step 2: Pre-install libraries to external UC Volumes OR PYTHON_LIB_PATH_MOUNTED 
import os
os.environ["UCvols_PYTHON_LIB_PATH"] = "/Volumes/catalog_name/schema_name/preinstalled_libs"

%sh pip install --upgrade pip

# Libraries that are not natively included in e.g. the ML runtime can be installed. Here we also specify the installation target to be a Unity Catalog Volumes path defined above and mapped to the os.environ
%sh pip install --upgrade easydict --target=$UCvols_PYTHON_LIB_PATH
%sh pip install --upgrade torch-scatter --target=$UCvols_PYTHON_LIB_PATH --verbose
%sh pip install --upgrade torch-sparse --target=$UCvols_PYTHON_LIB_PATH --verbose
%sh pip install --upgrade torch-spline-conv --target=$UCvols_PYTHON_LIB_PATH --verbose
# --verbose mainly to check if there are errors and runtime and/or GPU with package version compatibility issues during installation 

# e.g. check cuda version for torch_geometric
%sh ls -l /usr/local | grep cuda
%sh pip install torch_geometric --target=$UCvols_PYTHON_LIB_PATH --verbose
# Pre-Installation can take a while... but it's usually done 1x (or updated when needed)

At this point in the above code snippet, the packages are installed to the UC Volume and not yet available to the notebook session. (Note that the installation time is roughly the same as it would take if directly installed from the notebook without a target path). We will need to append the UC Volume path to sys.path so that users can access them from the notebook. 

The speedup is observed after the pre-installation to UC Volumes and it is best tested in another notebook with the following steps: 

## Step 3: Use within notebooks by adding to sys.path 
import sys
sys.path.append("/Volumes/catalog_name/schema_name/preinstalled_libs")

# check appended path:
sys.path

## Step 4: Import as notebook-scoped libraries whenever a cluster is ready
from torch_geometric.data import Data, InMemoryDataset, DataLoader
from torch_geometric.nn import NNConv, BatchNorm, EdgePooling, TopKPooling, global_add_pool
from torch_geometric.utils import get_laplacian, to_dense_adj

It is worth noting that notebook-scoped libraries are session-based, so the pre-installed library path must be re-appended in subsequent sessions. To work around this, the process of appending the pre-install path can be added to a .ipython/profile startup file and sym-linked to the default path ~/.ipython/profile_default/startup using an init.sh script. This allows the preinstalled libraries to be added to the default .ipython profile path during cluster initialization, making them accessible when the cluster is ready, avoiding dependency download and compilation during typical cluster-scoped library installation. An example of this solution can be found here.

[return to top of page]

4.3. Unity Catalog for Artifact Management

In addition to providing the governance, lineage visibility, and assets associated with ML projects, Unity Catalog also helps streamline artifact and dependency management with integrated MLflow. We can indeed track, log, and version data, dependencies, models, and artifacts in Unity Catalog.

For example, we can write data and environment YAML to Unity Catalog and subsequently use these paths as references in the MLflow model logging process. (Example code reference)

import mlflow
from mlflow.models import infer_signature
import pandas as pd
import pyspark.pandas as ps
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import precision_score, recall_score
import os

# Get the user's home directory path
user_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()

# Write data as a delta table to Unity Catalog
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df.rename(
  columns={col: col.replace(' (cm)', '').replace(' ', '_') for col in iris_df.columns},
  inplace=True
)
iris_df['species'] = iris.target
ps.from_pandas(iris_df).to_table(f"{catalog_name}.{schema_name}.iris", mode="overwrite") # table version could be specified during model logging

# Define the conda environment
conda_env = """
name: mlflow-env
channels:
  - defaults
dependencies:
  - python=3.8.5
  - scikit-learn=0.24.1
  - mlflow=2.9.2
  - pip
  - pip:
    - mlflow
    - pandas
    - pyspark
"""

# Write the conda_env to a UC Volume for subsequent reference in model logging
conda_env_volume_path = f"/Volumes/{catalog_name}/{schema_name}/iris_rfclassifier/conda_env.yaml"
os.makedirs(os.path.dirname(conda_env_volume_path), exist_ok=True)
with open(conda_env_volume_path, "w") as f:
    f.write(conda_env)

# Load the Unity Catalog table  
dataset = mlflow.data.load_delta(table_name=f"{catalog_name}.{schema_name}.iris", version="0")
pd_df = dataset.df.toPandas()
X = pd_df.drop("species", axis=1)
y = pd_df["species"]

# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Data versioning, model dependencies previously defined in conda_env, model parameters, evaluation metrics, and model updates can all be tracked, logged, versioned, as well as registered to Unity Catalog:

# Set registry to Unity Catalog  
mlflow.set_registry_uri("databricks-uc")

# Set the experiment explicitly
experiment_path = f"/Users/{user_path}/mlflow_experiments/dependencies/iris_data_rfclassifier"
mlflow.set_experiment(experiment_path)

# Define model hyperparameters 
params = {
    "n_estimators": 5, 
    "random_state": 432,
    "max_depth": 3,
    "min_samples_split": 10,
    "min_samples_leaf": 5,
    "max_features": "log2", 
    "bootstrap": True
}

# Train a model, log input table, parameters, metrics etc.
with mlflow.start_run() as run:
    # Define the model
    rfc = RandomForestClassifier(**params).fit(X_train, y_train)
    
    # Specify the required model input and output schema 
    signature = infer_signature(X_train, rfc.predict(X_train))
    # Take the first row of the training dataset as the model input example.
    input_example = X_train.iloc[[0]]

    # Log the input dataset and reference it as the 'training' dataset
    mlflow.log_input(dataset, "training")    
    # Log the model parameters 
    mlflow.log_params(params)
    
    ## Track model metrics with experiment run for subsequent comparisons 
    # Calculate and log training metrics
    train_predictions = rfc.predict(X_train)
    train_accuracy = accuracy_score(y_train, train_predictions)
    train_precision = precision_score(y_train, train_predictions, average='weighted')
    train_recall = recall_score(y_train, train_predictions, average='weighted')
    # Log the training metrics 
    mlflow.log_metric("train_accuracy", train_accuracy)
    mlflow.log_metric("train_precision", train_precision)
    mlflow.log_metric("train_recall", train_recall)
    
    # Calculate and log test metrics
    test_predictions = rfc.predict(X_test)
    test_accuracy = accuracy_score(y_test, test_predictions)
    test_precision = precision_score(y_test, test_predictions, average='weighted')
    test_recall = recall_score(y_test, test_predictions, average='weighted')
    # Log the test metrics
    mlflow.log_metric("test_accuracy", test_accuracy)
    mlflow.log_metric("test_precision", test_precision)
    mlflow.log_metric("test_recall", test_recall)
   
    # Log the model and register it as a new version in UC 
    mlflow.sklearn.log_model(
        sk_model=rfc,        
        artifact_path="sklearn-rfclassifier-model",  
        signature=signature, 
        input_example=input_example, 
        conda_env=conda_env_volume_path,
        registered_model_name="mmt_demos.dependencies.iris_rfclassifier",
    )

# [Alternatively] Register outside of model logging
model_uri = f"runs:/{run.info.run_id}/sklearn-rfclassifier-model"
mv = mlflow.register_model(model_uri,                
                           f"{catalog_name}.{schema_name}.iris_rfclassifier"
                           )

Centralizing all related model development information allows for easy tracking of relevant assets and data sources for experiment runs, which simplifies debugging and comparisons (e.g. different data transforms, model hyperparameters, model type or flavor) where needed.

[return to top of page]

4.4. Deep Learning and GenAI Considerations

Deep learning and GenAI workloads often require specialized dependency management due to their complexity and resource requirements. Common approaches are highlighted below:

4.4.1. GPU-Enabled Runtime with Specialized Libraries

# Creating a GPU cluster with specialized DL libraries
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()

cluster_id = w.clusters.create(
    cluster_name="gpu-dl_or_genai-cluster",
    spark_version="14.3.x-gpu-ml-scala2.12",  # GPU-enabled ML runtime
    node_type_id="Standard_NC24ads_A100_v4", # GPU instance
    num_workers=0,  # Single-node for DL
    spark_conf={
        "spark.databricks.cluster.profile": "singleNode"
    },
    custom_tags={
        "ResourceClass": "SingleNode"
    }
).result().cluster_id

# Install specialized libraries
w.libraries.install(
    cluster_id=cluster_id,
    libraries=[
        {"pypi": {"package": "torch==2.1.0"}},
        {"pypi": {"package": "transformers==4.34.0"}},
        {"pypi": {"package": "accelerate==0.23.0"}},
	  # ... 
        {"requirements": "/Volumes/path/to/dl_or_genai_project/requirements.txt"}
    ]
)

The Faster-Library-Loads approach (noted previously in section 4.2.2.) can also be applicable if a long list of deep learning or GPU-related dependencies is needed.  

4.4.2. Wrapping Large Models as MLfow Custom PyFunc 

When developing Deep learning and/or GenAI applications that require training and/or fine-tuning on custom data, users can leverage large transformer models like those available on Hugging Face. Example transformer models used in the Life Sciences include Evolutionary Scale Modeling (ESM) and Geneformer

To leverage such models on the Databricks Platform, we recommend users wrap these as MLflow Custom Model PyFunc along with required dependencies, which allows these to be logged and registered as models in Unity Catalog:

# Custom PyFunc wrapper for a large transformer model
import mlflow
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, logging
import shutil
import pandas as pd
from mlflow.models.signature import infer_signature

# Set the logging level to ERROR to disable verbose messages
# logging.set_verbosity_error()

# Define a Custom wrapper class for the transformer model
class ESMWrapper(mlflow.pyfunc.PythonModel):
    # load_context: Loads the model and tokenizer from the provided artifacts and sets up the device (CPU or GPU).
    def load_context(self, context):
        # Load ESM model from saved files in the artifact
        # self.artifacts = context.artifacts -- this is assumed

        # Determine the device to use (GPU if available, otherwise CPU)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Load the tokenizer & model from the provided artifacts
        self.tokenizer = AutoTokenizer.from_pretrained(context.artifacts["tokenizer"])      
        self.model = AutoModelForMaskedLM.from_pretrained(context.artifacts["model"])
        # Move the model to the appropriate device (GPU or CPU)
        self.model.to(self.device)
        # Set the model to evaluation mode
        self.model.eval()

        # Set special tokens if they are not set; ensure that the beginning-of-sequence (bos_token) and separator (sep_token) tokens are set. If they are not already set, they are assigned the value of the cls_token (classification token).
        if self.tokenizer.bos_token is None:
            self.tokenizer.bos_token = self.tokenizer.cls_token
        if self.tokenizer.sep_token is None:
            self.tokenizer.sep_token = self.tokenizer.cls_token
    
    # Define the predict function which takes input sequences, tokenizes them, runs them through the model, and returns the embeddings
    def predict(self, context, model_input):
        protein_sequences = model_input["sequences"]
        results = []
        
        # Process each sequence
        for seq in protein_sequences:
            inputs = self.tokenizer(seq, return_tensors="pt")
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            
            with torch.no_grad():
                outputs = self.model(**inputs, output_hidden_states=True)
                
            # Process outputs as needed for your application
            embeddings = outputs.hidden_states[-1].mean(dim=1).cpu().numpy()
            results.append(embeddings)
            
        return results

MLflow model logging allows specifying the environment dependencies

# Log the model with explicit dependencies
with mlflow.start_run():
    # Download and Save Model Components:
    model_name = "facebook/esm2_t33_650M_UR50D" #https://huggingface.co/facebook/esm2_t33_650M_UR50D
    
    # Specifies the model name and paths to save the model and tokenizer.
    model_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/tmp_model"
    tokenizer_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/tmp_tokenizer"
    
    # Download the pre-trained model and tokenizer from Hugging Face.
    model = AutoModelForMaskedLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Saves the model and tokenizer to the specified paths.
    model.save_pretrained(model_path, safe_serialization=False)
    tokenizer.save_pretrained(tokenizer_path)
    
    # Define conda env with necessary dependencies
    conda_env = {
        "channels": ["defaults", "conda-forge", "pytorch"],
        "dependencies": [
            "python=3.11", # compute 15.4LTSMLR 
            "pip>=22.0.4",
            {"pip": [
                "torch==2.1.0", 
                "transformers==4.34.0",
                "accelerate==0.23.0",
                "cloudpickle==3.1.1", #compute 15.4LTSMLR    
            ]}
        ],
        "name": "esm_env"
    }
    
    # Create a sample input DataFrame to infer the input and output signature of the model.
    sample_input = pd.DataFrame({"sequences": ["MKTAYIAKQRQISFVKSHFSRQDILDLWIYHTQGYFPDWQNYG"]})
    
    ## Initialize the wrapper and load the context manually for signature inference

    # Initialize an instance of the ESMWrapper class. 
    esm_wrapper = ESMWrapper()
    # Manually set the tokenizer and model for the wrapper.
    esm_wrapper.tokenizer = tokenizer
    esm_wrapper.model = model
    # Determine the device (GPU or CPU) and moves the model to the appropriate device.
    esm_wrapper.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    esm_wrapper.model.to(esm_wrapper.device)
    # Set the model to evaluation mode.
    esm_wrapper.model.eval()
    
    # Use the wrapper to predict the output for the sample input
    sample_output = esm_wrapper.predict(None, sample_input)
    # Infer the input and output signature of the model using the sample input and output
    signature = infer_signature(sample_input, sample_output)
    
    # Log the model with MLflow, including the artifacts (model and tokenizer paths), Conda environment, signature, and input example.
    mlflow.pyfunc.log_model(
        artifact_path="esm_model",
        python_model=ESMWrapper(),
        artifacts={
            "model": model_path,
            "tokenizer": tokenizer_path,
            "requirements": requirements_path  
        },
        conda_env=conda_env,
        signature=signature,
        input_example=sample_input,
        # Register the model with a specified name.
        registered_model_name=f"{catalog_name}.{schema_name}.esm_protein_model"
    )

4.4.3. Serving Large Models with Dependencies as Custom PyFunc Models

When Custom PyFunc Models are UC-registered, you can further serve these models with their MLflow-packaged dependencies as endpoints for your applications. (Example code reference.)

# Deploying a large model to Model Serving with GPU

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

# Define the full API request payload
endpoint_config = {
    "name": general_model_name,
    "served_models": [
        {                
            "model_name": registered_model_name,
            "model_version": latest_model_version,
            "workload_size": workload_size,  # defines concurrency: Small/Medium/Large
            "workload_type": workload_type,  # defines compute: GPU_SMALL/GPU_MEDIUM/GPU_LARGE
            "scale_to_zero_enabled": True
        }
    ],
    "traffic_config": {
        "routes": [
            {
                "served_model_name": general_model_name,
                "traffic_percentage": 100
            }
        ]
    },
    "auto_capture_config": {
        "catalog_name": catalog_name,
        "schema_name": schema_name,
        "enabled": True
    },
    "tags": {
        "project": "esm_protein_model",       
    }
}

# Create or update the endpoint
try:
    # Check if endpoint exists
    existing_endpoint = client.get_endpoint(endpoint_name)
    print(f"Endpoint {endpoint_name} exists, updating configuration...")
    client.update_endpoint_config(endpoint_name, endpoint_config)
except Exception as e:
    if "RESOURCE_DOES_NOT_EXIST" in str(e):
        print(f"Creating new endpoint {endpoint_name}...")
        client.create_endpoint(endpoint_name, endpoint_config)
    else:
        raise

[return to top of page]

4.5. When Conda and Docker Are Still Relevant

While not recommended for general use, there are specific scenarios where Conda and Docker still play important roles in the Databricks ecosystem.    

4.5.1. Conda with Databricks Managed MLflow

As observed in Section 4.4.2. the Custom PyFunc Model's required dependencies are specified using conda_env. For MLflow applications that rely on Conda as the env_manager to capture environment dependencies, users are reminded to make a note of Anaconda license requirements

# When logging models, MLflow captures the Conda environment
with mlflow.start_run():
    model = train_model()
    
    # Define explicit Conda environment for reproducibility
    conda_env = {
        "channels": ["defaults", "conda-forge"],
        "dependencies": [
            f"python={sys.version.split()[0]}",
            "scikit-learn=1.3.2",
            {"pip": ["xgboost==1.7.6"]}
        ]
    }
    
    # Log model with Conda environment
    mlflow.sklearn.log_model(
        model, 
        "model", 
        conda_env=conda_env,
        registered_model_name="catalog_name.schema_name.my_model"
    )

4.5.2. Docker for Specialized Use Cases

Docker containers remain valuable for specific scenarios:

  1. Specialized packages - For example, NVIDIA BioNeMo or other packages with complex system dependencies.
  2. Custom runtimes - When there is a need for system libraries not available in Databricks Runtime
  3. External deployment - When models need to be deployed outside Databricks
# Example Cluster Config JSON: Using a custom Docker container with specialized libraries on a standard cluster definition setup

{
"cluster_name": "BioNemoDockerCluster",
"spark_version": "14.3.x-scala2.12",
"spark_conf": {"spark.databricks.unityCatalog.volumes.enabled": "true"},
"aws_attributes": {"zone_id": "us-west-2c"}, //help to avoid capacity limits
"node_type_id": "g5.12xlarge", //Specifies the type of EC2 instance to use, which is "g5.12xlarge" (A10G GPU instance)
"custom_tags": {"removeAfter": "yyy-mm-dd",},
"autotermination_minutes": 120,
"enable_elastic_disk": true, //allow the cluster to dynamically increase disk space as needed

"docker_image": {"url": "{docker_profile}/bionemo_dbx_v0_amd64:latest", 
                 //docker built with -platform amd64
   "basic_auth": { "username": "{{secrets/<scope>/docker_PAT_user}}",
                   "password": "{{secrets/<scope>/docker_PAT_pw}}"
                 },
   "single_user_name": "{UUID_{groupname}_SP}", 
   //here we specify a group_level Service Principal
   "data_security_mode": "DATA_SECURITY_MODE_DEDICATED",
   "runtime_engine": "STANDARD",
   "kind": "CLASSIC_PREVIEW",
   "use_ml_runtime": false,
   "is_single_node": true,
   "num_workers": 0,
   "apply_policy_default_values": false
               }
}

Note that the docker image built and used on the Databricks cluster will need to include relevant framework dependencies for the intended runtime; a reference is provided, e.g., for ubuntu-gpu-Docker. Given that there is no timely rollout nor maintenance of docker image associated with each existing or new runtime, nor is backward compatibility guaranteed, developers will be responsible for updating dependencies and troubleshooting locally prior to testing them out. 

[return to top of page]

4.6. When Custom Libraries Conflict with Runtime Libraries 

Library conflicts can occasionally occur due to version incompatibility or other compatibility issues between the runtime and custom-installed Python libraries installed as workspace files (NB: workspace file-size limits). It can be helpful to appreciate that the precedence of Python libraries is determined by the order in which Python libraries are added to the Python sys.path in workspace files. When the import <library> command is run, libraries installed in current Databricks Git folders are prioritized. Notebooks outside Git folders add the current working directory after other libraries are installed, while Workspace directories that are manually appended have the lowest priority.

[return to top of page]

4.7. Infrastructure as Code (IaC) for Dependency Management

For automating the provisioning and configuration of Databricks environments, and ensuring consistency and repeatability, Databricks Terraform provider lets users manage Databricks infrastructure using Terraform, an Infrastructure as code (IaC) tool. For example:

4.7.1. Terraform for Cluster and Library Management

# Terraform configuration for a cluster with dependencies
resource "databricks_cluster" "ml_training_cluster" {
  cluster_name            = "ml-training-cluster"
  spark_version           = "14.3.x-cpu-ml-scala2.12"
  node_type_id            = "Standard_DS3_v2"
  autotermination_minutes = 20
  num_workers             = 2

  spark_conf = {"spark.databricks.cluster.profile" = "singleNode"}

  # Cluster libraries
  library {pypi {package = "scikit-learn==1.3.2"}}
  
  library {
    pypi {package = "mlflow==2.8.0"}}
  
  library {whl = "/Volumes/catalog_name/schema_name/wheels/custom_lib-1.0.0-py3-none-any.whl"}}

Automation with Terraform also facilitates the management of dependencies across multiple environments, simplifies maintenance, and reduces the risk of configuration drift. (Additional examples are listed here.)

4.7.2. Databricks Asset Bundles (DABs)

In fact, you can manage complex Databricks ML projects where multiple contributors and automation are essential, and continuous integration and deployment (CI/CD) is required with an IaC approach by leveraging  Databricks Asset Bundles (DABs). For example:

4.7.2.1. CI/CD with Databricks Asset Bundles (DABs) 

example-repo/
|---- .github/
|       |---- workflows/
|               |---- deploy-dab.yml
|---- databricks.yml  # This file specifies the complete DAB bundle definition

#-----------------------------------------------------------------------

# Note: Define your DAB bundle within the databricks.yml file with resources like:
# - jobs, pipelines, and workflows
# - notebooks and their locations
# - ML models and endpoints
# - cluster configurations
# - dependencies and libraries
# Without this definition, the CI/CD pipeline has nothing to deploy.

#-----------------------------------------------------------------------

# .github/workflows/deploy-dab.yml
name: Deploy DAB

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install Databricks CLI
        run: pip install databricks-cli
      
      - name: Deploy DAB
        run: 
          databricks bundle deploy
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

4.7.2.2. Use DABs to package entire MLOps stacks and components with defined resources.

Using DABs, Databricks resources like jobs, pipelines, notebooks, dependencies, etc. can be defined as source files. These files provide a complete project definition, including structure, testing, and deployment processes. This comprehensive approach simplifies collaboration throughout the project's development lifecycle. Example code snippet

# databricks.yml for a DAB containing model training resources e.g.

jobs:
  train_model_job:
    name: ${bundle.target}-Training-Job
    job_clusters:
      - job_cluster_key: "training_cluster"
        cluster_key: "training_cluster"
    tasks:
      - task_key: "train_model"        
        notebook_task:
          notebook_path: "../training/notebooks/train.py"
          base_parameters:
            input_table_name: "${bundle.target}.project_schema_name.input_data"
        libraries:
          - whl: ./my-wheel-0.1.0.whl
          - whl: /Workspace/Shared/Libraries/my-wheel-0.0.1-py3-none-any.whl
          - whl: /Volumes/main/default/my-volume/my-wheel-0.1.0.whl
        job_cluster_key: "training_cluster"  # Reference to the job cluster
  
  clusters:
    training_cluster:
      spark_version: "14.3.x-cpu-ml-scala2.12"
      node_type_id: "Standard_DS3_v2"
      num_workers: 4
      data_security_mode: "SINGLE_USER"       

[return to top of page]

            

5. Best Practices Summary

As discussed in the previous sections, effective dependency management is essential for reliable MLOps workflows. Figure 2 visually represents the dependency management process throughout the MLOps lifecycle, from development to production deployment. Based on the guidance discussed, the following summary outlines best practices for managing dependencies on Databricks.

mmt_1-1747088202006.png

Figure 2. Dependency management within the broader MLOps lifecycle, illustrating how approaches evolve from development through production, and highlighting optimization techniques that can be applied throughout the process.

 

5.1. Let’s summarize our guidelines below:

  1. Use cluster-level library management as the primary approach for most workflows
  2. Consider pre-installed libraries in Volumes for large dependency sets to improve startup times
  3. Leverage Unity Catalog for storing and versioning model artifacts
  4. For deep learning/GenAI:
    • Use GPU-enabled runtimes
    • Where possible wrap large models with MLflow Custom PyFunc to control loading and inference
    • Consider model-specific optimization techniques
    • Use appropriate workload sizing for model serving
  5. When Conda & Docker are used:
    • Use MLflow's built-in dependency tracking rather than managing Conda environments directly
    • Reserve Docker containers for specialized packages or system dependencies not available in Databricks runtimes
  6. When custom libraries conflict with default runtime libraries installed to workspace; libraries installed in current Databricks Git folders take precedence.
  7. Use infrastructure as code (e.g. Terraform, DABs) to ensure reproducible environments
  8. Monitor and optimize:
    • Track cluster startup times
    • Regularly review and update dependencies (e.g. tools like dependabots can help track dependency updates)
    • Test dependency changes with notebook-scoped libraries before updating changes at the cluster and/or pipeline job level.

 

Key Takeaways

Through platform-native integrations with Unity Catalog and MLflow, Databricks dependency management aims to enhance governance, performance, and operational ease. Approaches like Faster-Library-Loads with volume-stored pre-installed packages can significantly improve performance when optimizing cluster startup with many libraries. 

As workloads grow in complexity, especially with deep learning and GenAI applications, dependency management requires additional considerations for GPU support, model size, and specialized libraries. Using infrastructure as code with Terraform and DABs can help ensure that these dependencies are consistently applied across environments in your MLOps workflows. 

[return to top of page]

 

------------------------------------------------------------------------------------------------------------------------------------------------------------

Appendix

 

Table 1. Choosing Between “Classic” and Serverless Clusters

Aspect

“Classic” / All-Purpose 

(Standard | ML Runtime

Serverless 

Cluster Management

Requires user setup and management, including configuring hardware and software, provisioning resources, and managing scaling.

Fully managed. Databricks handles provisioning, scaling, and infrastructure management automatically, freeing users from operational overhead.

Resource Customization

Allows fine-grained control over computing resources (e.g., node sizes, instance types, and configurations). Suitable for users who need specific hardware/software setups.

Limited customization of resources—users cannot configure individual node types or resource settings in detail. Focuses on simplicity and abstraction.

Scalability

Requires explicit configuration to scale nodes/processes, which may require manual intervention or automation via scripting.

Automatically scales resources up or down based on workload, with minimal user intervention. Enables pay-as-you-go scalability.

Cost Efficiency

May be less cost-efficient for intermittent or unpredictable workloads because resources remain reserved even when idle.

More cost-efficient for variable and unpredictable workloads due to automatic scaling and serverless architecture—pay for resources only while they’re in use.

Ease of Use

Requires more expertise to set up, configure, and optimize, particularly for advanced ML workloads or unusual hardware/software requirements.

Designed for ease of use—ideal for users/team members without extensive expertise in cluster management or infrastructure workflows.

Performance for Complex Workloads

Tailored for complex workloads, long-running jobs, or specialized environments that require specific library versions, GPU instances, or high-compute nodes.

Optimized for simpler, on-demand workloads such as short-running jobs, lightweight data processing, or proof-of-concept pipelines.

Typical Use Case / Scenario

Processing pipelines that require specific software versions or libraries.


Machine / Deep learning tasks that benefit from specialized hardware acceleration e.g. CUDA

Ad-hoc / on-demand workloads needing minimal configuration and short-lived runtime: e.g. Experimentation with data and code without long-term resource commitment. 

Dependency Management

It provides full control over dependency installation and configuration via cluster-scoped libraries, links to standard package repositories, such as PyPi, Cran/Posit, and Maven, including Jar/Wheel files, as well as init scripts.

Automatically manages common dependencies; May require notebook-scoped libraries or additional configuration for custom dependencies. Given limitations, Standard or ML Runtime clusters may be a better option for workloads requiring currently unsupported features (e.g., GPUs, ML Runtimes, or specific Spark functionalities)

[return to content or return to top of page]

 

Table 2.  Library Scopes in Databricks: A Comparison 

Aspect

Notebook-Scoped

Cluster-Scoped 

Serverless Notebook Environments 

Worker Distribution

Installed on-demand on workers executing the notebook. Each worker maintains its own copy. Distributed via Spark's internal mechanism.

Uniformly installed on all nodes during cluster startup. Guaranteed availability on all executors. Consistent across the entire cluster.

Pre-installed in serverless compute environment. No user control over worker configuration. Managed by Databricks platform.

Node Consistency

Potential inconsistencies if workers join/leave during execution. New workers need to install libraries when joining. 

All nodes have identical library configurations. Auto-scaling nodes automatically receive the same libraries. Consistent environment across restarts 

Consistent across all serverless workers. Fully managed scaling with identical environments. No node-specific configurations possible

Driver vs. Worker

Installed first on driver node. Then propagated to workers as needed. Potential for environment differences.

Identical setup on driver and all workers. No discrepancies between nodes. Predictable behavior across the cluster. 

Identical environment across all compute resources. Managed driver/worker configuration. No visibility into the underlying infrastructure.

User Isolation

Each user can have their own library versions. User A's libraries don't affect User B. This prevents "dependency hell" in shared clusters. 

All users share the same libraries and versions. Changes affect all users. Potential conflicts between user requirements.

Each serverless job has isolated environment. No cross-contamination between workloads. Fully isolated execution environments.

Collaboration

Dependencies must be explicitly documented. Different users may get different results. Requires explicit sharing of dependency info. 

Consistent environment for all collaborators. Predictable behavior across users. Simplified sharing of notebooks. 

Consistent behavior for all users accessing the same endpoint. Dependencies defined in job configuration. Simplified governance and standardization

Permissions 

Any user can install without admin privileges. Flexible for individual dependency management. No approval process. needed

Typically requires cluster admin privileges. Centralized control over available libraries. Can enforce organizational standards 

Requires permission to modify job definitions. Centralized management via workspace settings. Admin-controlled library whitelisting is possible.

Resource Usage 

May duplicate libraries across notebooks. Higher memory usage with multiple versions. Installation impacts notebook startup time 

Single shared installation. More memory-efficient. No runtime installation overhead.

No resource overhead for library installation. An optimized environment with a minimal footprint. Pay only for actual compute time used

Performance Impact 

Installation during notebook execution. Can slow down initial notebook cells. May cause timeouts with complex dependencies.

Increases cluster startup time. Front-loads installation cost. Better runtime performance.

No installation overhead. Instant startup with pre-configured environment. Optimized for short-running jobs.

Library Management 

Managed within notebook code. Simple pip commands. Easy to version control with the notebook. 

Managed through UI or API, it requires a cluster restart to apply. It can be automated with infrastructure as code. It is defined in job configuration.

Version-pinned in deployment specs. Supports wheel files, PyPI, Maven, etc.

Scaling Behavior

Libraries must be installed on new workers, which can slow down elastic scaling. Installation time increases with library complexity.

Increases cluster startup time. Front-loads installation cost. Better runtime performance

Libraries are pre-installed on all nodes. Consistent during autoscaling. Slower initial cluster startup.

Instant scaling with no library installation. Consistent performance during scale-out. Optimized for variable workloads.

Best For 

Experimentation and development. Individual data scientists. Testing new library versions. 

Production workflows on dedicated clusters. Shared development environments. Standardized team processes. 

Production jobs with variable load. Cost-optimized workloads. Simplified DevOps and maintenance.

[return to content or return to top of page]

2 Comments