Databricks Community

jmeulema · ‎10-08-2025

Securing Python Dependencies on Databricks Serverless with Unity Catalog Volumes

Authors:

Tim Dikland, Resident Solutions Architect Databricks
Jeroen Meulemans, Solutions Architect Databricks

Why dependency security matters

Attacks on package registries like PyPI, including typosquatting, dependency confusion, and malicious uploads, have become a common vector for supply-chain compromises. By injecting vulnerabilities into dependencies, these attacks can slip past developers and land in production environments.

Enterprises typically mitigate this risk by pulling packages from a private repository that proxies PyPI and enforces policies such as:

Vulnerability scanning
Hash-pinning for integrity
Strict index configuration

This way, dependencies are vetted before being used in production workloads.

Package management in serverless environments without public internet access

On classic Databricks compute, customers often use VNET injection (Azure) or customer managed VPC (AWS) to allow access to their private enterprise package repository.

With serverless compute, Databricks manages the network. Accessing an enterprise grade 3rd party private repository typically requires private/secure connectivity to the customer network. Artifact repositories are not yet supported as first class private endpoints, so one needs to create a private link to a load balancer and handle connectivity to the artifact repository there (e.g. by deploying a reverse proxy). That extra setup introduces operational overhead.

A simpler alternative: Unity Catalog Volumes as a private repository

Instead of wiring networking paths and proxies, there’s a simpler approach:

Publish vetted packages into a Unity Catalog Volume
Install them directly from the volume in serverless compute

This avoids PrivateLink, reverse proxies, and self-managed compute, reducing both cost and operational effort.

Important caveat: this should not be seen as a long-term enterprise-wide architecture. Access control in Databricks is enforced at the volume level, not the package level. Centralizing a company-wide repository in a single volume can create governance challenges, particularly around access and auditability.

A more practical approach is to apply this pattern locally, for example, a team, business unit, or specific use case curates its own volume and does not attempt to scale this as the enterprise-wide source of truth.

How to set it up

1. Publish vetted packages using classic compute

On a classic Databricks cluster, you can generate and store wheels for your dependencies in a Unity Catalog Volume. This creates a central, reusable package repository for your serverless workloads. For example, you might place wheels under:

/Volumes/catalog/schema/libs/mypackage-0.0.1-py3-none-any.whl

There are several ways to populate this location, depending on how you manage dependencies in your enterprise:

Build wheels from installed packages

%pip wheel databricks-openai scikit-learn lightgbm \
    -w /Volumes/catalog/schema/libs

Fetch packages from a private PyPI repository

%pip download scikit-learn lightgbm databricks-openai \
  --index-url https://<your-private-pypi-repo>/simple \
  --trusted-host <your-private-pypi-repo> \
  -d /Volumes/catalog/schema/libs

Download prebuilt multi-arch wheels from PyPI

%pip download scikit-learn lightgbm databricks-openai \
  --only-binary=:all: \
  --platform manylinux2014_x86_64 \
  --platform manylinux2014_aarch64 \
  -d /Volumes/catalog/schema/libs

After completing one or more of these steps, your Unity Volume contains vetted, prebuilt wheels ready to be consumed by serverless compute.

2. Consume vetted packages in serverless compute

On Databricks serverless, install only from your Unity Volume — without reaching out to PyPI:

%pip install --no-index --find-links=/Volumes/catalog/schema/libs \
    databricks-openai scikit-learn lightgbm

This guarantees you’re able to install the libraries and consume vetted wheels you’ve published.

Important caveat: Python version alignment

When publishing and consuming wheels, ensure that Python versions match between classic and serverless runtimes. For example:

Databricks Runtime 17.2 (classic compute): Python 3.12.3 👉 Release notes
Databricks Serverless Runtime v4: Python 3.12.3 👉 Release notes

A mismatch may cause wheels to fail installation.

Benefits of this approach

By adopting Unity Catalog Volumes as your package repository for serverless workloads, you unlock the following advantages:

⚡ Faster startup: download your Python packages straight from a Unity Volume.
🔄 Consistency: same vetted versions across all jobs & environments.
💻 Cross-platform support: works for both x86 and ARM runtimes with the multi-arch wheels you cached.
🔒 Reproducibility: deterministic builds and reduced supply-chain risk.

Conclusion

While traditional approaches rely on private PyPI repos and network plumbing, Unity Catalog Volumes provide a lightweight, Databricks-native alternative for securing Python dependencies in serverless compute.

By publishing vetted wheels once and installing them directly from a volume, you can reduce both supply-chain risk and operational burden while keeping your Databricks environments fast, consistent, and secure.

Databricks Community

Securing Python Dependencies on Databricks Serverless with Unity Catalog Volumes

Securing Python Dependencies on Databricks Serverless with Unity Catalog Volumes

Authors:

Table of Contents

Why dependency security matters

Package management in serverless environments without public internet access

A simpler alternative: Unity Catalog Volumes as a private repository

How to set it up

1. Publish vetted packages using classic compute

Build wheels from installed packages

Fetch packages from a private PyPI repository

Download prebuilt multi-arch wheels from PyPI

2. Consume vetted packages in serverless compute

Benefits of this approach

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks