Attacks on package registries like PyPI, including typosquatting, dependency confusion, and malicious uploads, have become a common vector for supply-chain compromises. By injecting vulnerabilities into dependencies, these attacks can slip past developers and land in production environments.
Enterprises typically mitigate this risk by pulling packages from a private repository that proxies PyPI and enforces policies such as:
This way, dependencies are vetted before being used in production workloads.
On classic Databricks compute, customers often use VNET injection (Azure) or customer managed VPC (AWS) to allow access to their private enterprise package repository.
With serverless compute, Databricks manages the network. Accessing an enterprise grade 3rd party private repository typically requires private/secure connectivity to the customer network. Artifact repositories are not yet supported as first class private endpoints, so one needs to create a private link to a load balancer and handle connectivity to the artifact repository there (e.g. by deploying a reverse proxy). That extra setup introduces operational overhead.
Instead of wiring networking paths and proxies, there’s a simpler approach:
This avoids PrivateLink, reverse proxies, and self-managed compute, reducing both cost and operational effort.
Important caveat: this should not be seen as a long-term enterprise-wide architecture. Access control in Databricks is enforced at the volume level, not the package level. Centralizing a company-wide repository in a single volume can create governance challenges, particularly around access and auditability.
A more practical approach is to apply this pattern locally, for example, a team, business unit, or specific use case curates its own volume and does not attempt to scale this as the enterprise-wide source of truth.
On a classic Databricks cluster, you can generate and store wheels for your dependencies in a Unity Catalog Volume. This creates a central, reusable package repository for your serverless workloads. For example, you might place wheels under:
/Volumes/catalog/schema/libs/mypackage-0.0.1-py3-none-any.whl
There are several ways to populate this location, depending on how you manage dependencies in your enterprise:
%pip wheel databricks-openai scikit-learn lightgbm \
-w /Volumes/catalog/schema/libs
%pip download scikit-learn lightgbm databricks-openai \
--index-url https://<your-private-pypi-repo>/simple \
--trusted-host <your-private-pypi-repo> \
-d /Volumes/catalog/schema/libs
%pip download scikit-learn lightgbm databricks-openai \
--only-binary=:all: \
--platform manylinux2014_x86_64 \
--platform manylinux2014_aarch64 \
-d /Volumes/catalog/schema/libs
After completing one or more of these steps, your Unity Volume contains vetted, prebuilt wheels ready to be consumed by serverless compute.
On Databricks serverless, install only from your Unity Volume — without reaching out to PyPI:
%pip install --no-index --find-links=/Volumes/catalog/schema/libs \
databricks-openai scikit-learn lightgbm
This guarantees you’re able to install the libraries and consume vetted wheels you’ve published.
Important caveat: Python version alignment
When publishing and consuming wheels, ensure that Python versions match between classic and serverless runtimes. For example:
A mismatch may cause wheels to fail installation.
By adopting Unity Catalog Volumes as your package repository for serverless workloads, you unlock the following advantages:
While traditional approaches rely on private PyPI repos and network plumbing, Unity Catalog Volumes provide a lightweight, Databricks-native alternative for securing Python dependencies in serverless compute.
By publishing vetted wheels once and installing them directly from a volume, you can reduce both supply-chain risk and operational burden while keeping your Databricks environments fast, consistent, and secure.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.