cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
dkushari
Databricks Employee
Databricks Employee

Unified governance and interoperability for unstructured data

Summary

  • Access unstructured data in Unity Catalog Volumes from any external tool or application using a new credential vending API that issues temporary, scoped credentials for volumes based on UC permissions
  • Eliminate manual IAM management — external tools read directly from governed volumes using temporary scoped credentials
  • Govern all data and AI assets, such as tables, features, and unstructured data, through Unity Catalog with consistent permissions enforced across clouds and engines

Databricks Unity Catalog (UC) is the industry’s only unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform. Unity Catalog provides a single source of truth for your organization’s data and AI assets, with open connectivity to any data source and any format, unified governance with detailed lineage tracking, and support for open sharing and collaboration.

In this blog, we explore how you can securely access your unstructured data, registered in Unity Catalog Volumes, from an external (non-Databricks) processing engine using UC’s open APIs 

Open source tools like Daft and DuckDB, or distributed computing frameworks like Ray, can access Unity Catalog Volumes directly — no need to provision or manage cloud credentials outside of Unity Catalog. Under the hood, Unity Catalog automatically ‘vends’ scoped, per-volume credentials, respecting the permissions already defined for each user.

Unity Catalog Credential Vending Simplifies External Access to Unstructured Data

AI workloads today run on more than just tables — they require images and videos for multimodal models, documents for RAG pipelines, and sensor data for IoT analytics. Volumes in Unity Catalog bring unstructured data under the same governance as your tables, models, and features, so all your AI assets live in a single unified catalog. With Volumes, enterprises can upload unstructured data to their lakehouse, set access permissions on the data, and process it downstream using agents and tools.

Unity Catalog already provides the Files API for accessing data in Volumes. The Files API is designed for Databricks-native tooling — SDKs and integrations that route requests through Databricks. For external engines like Daft, DuckDB, or Ray that lack native Databricks connector support and must read data directly from cloud storage, UC now supports credential vending: a mechanism that issues temporary, scoped cloud storage credentials to grant them governed access to Unity Catalog volumes.

Previously, reading volumes in cloud storage from external engines required manually provisioning and maintaining cloud credentials outside of Unity Catalog. As teams scale and more external tools need access, this becomes increasingly difficult to manage securely and consistently. Credential vending solves this by tying access directly to UC's existing permission model — no separate IAM setup, no credential sprawl.

Unity Catalog Credential Vending Open APIs Extends to Volumes

Unity Catalog Open APIs already provide secure open access to structured data in tables for any engine; now we are extending this capability to unstructured data in volumes for AI and agentic use cases for external engines. Volume credential vending provides a standard way for external compute to access data governed by Unity Catalog Volumes. An external client can request temporary, scoped credentials to access a specific Volume, and Unity Catalog will grant them based on the user's privileges.

How Credential Vending Works

The following diagram illustrates the credential vending flow:

  1. A Databricks Admin defines privileges on a Unity Catalog Volume, governing the unstructured data, for a principal
  2. The principal queries the Unity Catalog credential vending API to access the Volume from an external query engine or tool.
  3. Unity Catalog validates the principal’s permissions required to perform the requested action
  4. If the permission check is successful, Unity Catalog returns the storage path along with temporary, scoped credentials.
  5. A data processing tool or query engine uses the vended credentials to access the data directly in cloud storage.
  6. Results are returned to the principal

A new Unity Catalog REST API endpoint (POST /api/2.0/unity-catalog/temporary-volume-credentials) has been introduced for generating temporary, scoped credentials for Volumes. The same operation is also available via the Databricks SDK.

Open access to volumes is part of what makes Unity Catalog the AI-native catalog for the open lakehouse: your data stays governed in one place, but every tool in your AI stack can access it securely through open APIs.

dkushari_0-1778801686344.png

See UC Volumes Credential Vending in Action

Let's take a look at a specific use case: An AI team may keep unstructured data, such as images, in Databricks Unity Catalog Volumes. These images are valuable but hard to use at scale without structure — unless they can be automatically classified, captioned, and organized.

Machine Learning (ML) engineers can analyze unstructured data at scale using Databricks compute — or access it directly from their local workstation using the Databricks CLI, SDKs, or the Volumes Files API. With Credential Vending, this same flexibility now extends to external tools of their choice, with credentials issued automatically, no manual IAM setup, and everything governed through Unity Catalog.

Engineers can build a pipeline using external tools like Daft — which has native Unity Catalog integration — to access images directly from the Unity Catalog Volume, and then use HuggingFace models to classify and caption them, producing labels and plain-language descriptions that make visual data searchable, reviewable, and audit-ready. This is useful for content operations, analytics, and compliance workflows — all without writing any credential management code.

The following steps walk through how to build this pipeline end-to-end.

Step 1: Set Up Your Environment

First, set the following environment variables for your Databricks workspace:

export DATABRICKS_HOST="https://<your-workspace-url>"
databricks auth login --host "https://<your-workspace-url>"

# Unity Catalog and schema (required)
UC_CATALOG=<Your UC Catalog>
UC_SCHEMA=<Your UC Schema>

# Volume names (just the volume name, not the full catalog.schema.volume path)
DATABRICKS_FILE_VOLUME_NAME=<Your volume name where tabular files are located>
DATABRICKS_IMAGE_VOLUME_NAME=<Your volume name where image files are located>

# Comma-separated image filenames in the volume (required)
IMAGE_FILENAMES=<Your comma-separated image files>

# AWS region for S3 access (default: us-east-1)
AWS_REGION=<Your AWS Region>

Alternatively, you can use a .env file to set these variables.

Install the required Python packages:

pip install "databricks-sdk>=0.108.0" python-dotenv daft getdaft "transformers>=4.47.0" torch Pillow

Step 2: Connect to Unity Catalog and Download Images

Using Daft's native Unity Catalog Volume integration, you can connect to your workspace and download images directly from a Volume. Daft calls the credential vending API automatically behind the scenes.

import daft
from daft.unity_catalog import UnityCatalog
import os
from databricks.sdk import WorkspaceClient

# Authenticate using OAuth U2M via Databricks SDK
w = WorkspaceClient()
endpoint = w.config.host
token = w.config.oauth_token().access_token

# Connect to Unity Catalog
# Daft calls the credential vending API automatically
unity = UnityCatalog(endpoint=endpoint, token=token)

# Build file paths for images in the Volume
filenames = os.environ["IMAGE_FILENAMES"].split(",")
file_paths = [
    f"vol+dbfs:/Volumes/{os.environ['UC_CATALOG']}/{os.environ['UC_SCHEMA']}/{os.environ['DATABRICKS_IMAGE_VOLUME_NAME']}/{f}"
    for f in filenames
]

# Download files directly from the Volume
df = daft.from_pydict({"files": file_paths})
downloaded = df.select(
    df["files"],
    df["files"].download(
        io_config=unity.to_io_config()
    ).alias("content")
).to_pylist()

Step 3: Classify and Caption Images with HuggingFace

Once the images are downloaded, apply HuggingFace vision models locally to classify and caption each image:

from io import BytesIO
from PIL import Image
from transformers import pipeline

# Load classification and captioning models
classifier = pipeline(
    "image-classification",
    model="google/vit-base-patch16-224"
)
captioner = pipeline(
    "image-text-to-text",
    model="Salesforce/blip-image-captioning-base"
)

# Process each image
for row in downloaded:
    image = Image.open(BytesIO(row["content"]))

    # Get top-5 classification labels
    labels = classifier(image, top_k=5)

    # Generate a natural language caption
    caption = captioner(images=image, text="a picture of")

    print(f"File: {row['files']}")
    print(f"Caption: {caption}")
    print(f"Labels: {labels}")

Sample Output

The output is a clean summary for each image — top-5 classification labels with confidence scores and a generated caption, ready for search, content review, and reporting.

dkushari_1-1778802071948.png

Caption: "a flower with a blurry background"

Classification:
  chambered nautilus, pearly nautilus, nautilus  (16.47%)
  daisy                                         (8.51%)
  pot, flowerpot                                (1.55%)
  coil, spiral, volute, whorl, helix            (1.36%)
  ear, spike, capitulum                         (0.92%)

The entire workflow stays within the governance boundary of Unity Catalog. Credentials are short-lived and scoped to the specific Volume — no long-lived keys, no IAM role juggling, and no data duplication needed.

Prerequisites and Setup

To start using Volumes credential vending, ensure the following:

  1. Enable External Data Access on your metastore.
  2. Grant `EXTERNAL_USE_SCHEMA` on the schema containing the Volume you want to access from external compute.
  3. Set up authentication — use Databricks U2M OAuth to grant access to Databricks resources.
  4. Install dependencies — Python 3.11+ with the packages listed in the example above.

For a complete working example, see the companion code repository, which includes all the scripts demonstrated in this blog.

Conclusion

Volumes Credential Vending brings the same open, governed access model that Unity Catalog provides for tables to your unstructured data. Whether you're building AI pipelines that classify images, RAG systems that index PDFs, or analytics workflows that process sensor data, you can now do it from any tool while keeping your data governed in one place.

As AI workloads grow more complex, the catalog that governs them needs to keep pace. Unity Catalog is open, built for AI, and setting the standard for how enterprises govern multimodal data.

We'd love to hear what you're building — reach out to your Databricks account team to share feedback.

Appendix:

Under the Hood: The Credential Vending API

For developers who want to integrate directly, here's how the credential vending API works at a lower level. The get_temp_vol_cred.py module demonstrates the core workflow:

Retrieve Volume Metadata

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import VolumeOperation

def get_volume_info_by_name(w, volume_name):
    """Retrieve volume metadata including storage location."""
    info = w.volumes.read(name=volume_name)
    return info.volume_id, info.storage_location

Request Temporary Credentials

def get_temporary_volume_credentials(w, volume_id, operation=VolumeOperation.READ_VOLUME):
    """Request short-lived credentials for a specific Volume."""
    return w.temporary_volume_credentials.generate_temporary_volume_credentials(
        operation=operation,
        volume_id=volume_id,
    )

These temporary credentials can then be used with any tool that supports cloud storage access — DuckDB, Ray, Pandas, or custom applications.

Other Engines: DuckDB and Ray

The credential vending API is engine-agnostic. Here are examples with two other popular engines/frameworks:

Querying with DuckDB:

import duckdb

aws = creds.aws_temp_credentials

conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute(f"""
    SET s3_region = '<your-aws-region>';
    SET s3_access_key_id = '{aws.access_key_id}';
    SET s3_secret_access_key = '{aws.secret_access_key}';
    SET s3_session_token = '{aws.session_token}';
""")

# Query parquet files directly from the Volume's storage
result = conn.execute(
    f"SELECT * FROM read_parquet('{storage_location}/*.parquet') LIMIT 10"
).fetchdf()

Processing with Ray:

import ray
import pyarrow.fs as pafs

aws = creds.aws_temp_credentials

# Create S3 filesystem with vended credentials
s3_fs = pafs.S3FileSystem(
    access_key=aws.access_key_id,
    secret_key=aws.secret_access_key,
    session_token=aws.session_token,
    region="<your-aws-region>"
)

# Read and process images at scale
dataset = ray.data.read_images(
    s3_paths,
    filesystem=s3_fs,
    include_paths=True
)

All code for this blog is available in this GitHub repository.