cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why is writing direct to Unity Catalog Volume slower than to Azure Blob Storage (xarray -> zarr)

songhan89
New Contributor

Hi,

I have some workloads whereby i need to export an xarray object to a Zarr store.

My UC volume is using ADLS.

I tried to run a simple benchmark and found that UC Volume is considerably slower.

a) Using a fsspec ADLS store pointing to the same container behind UC Volume. Result : 42 s.

b) Treat UC Volume as a LocalStore. Result : 93 s.

Does UC Volume support async I/O ? I am suspecting that this could be the reason behind the slower performance ?

 

 

 

import xarray as xr
import adlfs
import zarr
from zarr.storage import FsspecStore

fs = adlfs.AzureBlobFileSystem(account_name=ABS_ACCOUNT_NAME, credential=SILVER_SAS_TOKEN, asynchronous=True)

files = glob('./samples/N1S*01')

args_cubed = {'engine': 'cfgrib',
    'filter_by_keys': {
        'dataType': 'fc',
        'typeOfLevel': ['surface', 'isobaricInhPa']
        },
    'chunks': {}
 }

def preprocess(ds):
    return ds.expand_dims(['time', 'step'])

ds = xr.open_mfdataset(
    files,
    preprocess=preprocess,
    parallel=True,
    **args_cubed
)

ds2 = ds.load()

store_azb = FsspecStore(fs, path='silver/nwp/azb_benchmark_v3.zarr')
store_uc = zarr.storage.LocalStore('/Volumes/mss-uc/silver/silver-volume/nwp/unity_catalog_benchmark_v3.zarr')

 

 

 

songhan89_0-1738517230323.png

 



1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Writing directly to a Unity Catalog (UC) Volume in Databricks is often slower than writing to Azure Blob Storage (ADLS) using an fsspec-based store, especially for workloads exporting xarray objects to Zarr. This performance gap has been noted and discussed in technical communities.​

Key Reasons for Slower Writes to Unity Catalog Volumes

  • Lack of Async I/O Support: UC Volumes are commonly exposed as local POSIX-like filesystems within Databricks clusters. These filesystems typically do not support asynchronous I/O, unlike adlfs with the asynchronous=True parameter, which can take advantage of async read/write operations for higher throughput when interfacing directly with ADLS.​

  • Databricks Internal Layering: When writing to a UC Volume, the actual I/O operation is mediated by Databricks, which mounts the storage and provides additional security, metadata, and audit tracking. This extra abstraction layer can introduce overhead compared to more direct, parallelized communication through fsspec to the blob storage service.​

  • Optimization Differences: ADLS with fsspec is optimized for cloud-native parallel operations, while POSIX mounts (as provided by UC Volumes) can bottleneck on metadata updates, I/O scheduling, and caching, especially with workloads like Zarr writes, which generate a large number of small files.​

Practical Implications

  • When exporting large or many small objects (like with Zarr), using ADLS directly with fsspec is typically faster due to async capabilities and better parallelization.

  • UC Volumes are ideal for workloads requiring strong access controls, cataloging, and integration with Databricks' governance features, but they may sacrifice some I/O speed for those features.

  • If synchronous (blocking) I/O is required by the POSIX interface of UC Volumes, performance on write-heavy, highly parallel workloads will lag behind ADLS-based stores that support async operations.

Conclusion

Async I/O support is a significant factor in write performance. Direct writing with adlfs to Azure Blob Storage leverages async and parallel capabilities for much faster data export compared to UC Volumes, which rely on synchronous, POSIX-style write operations. This is reflected in your benchmarking results (42s with ADLS fsspec vs. 93s with UC Volume LocalStore). Until Databricks upgrades UC Volume capabilities or supports async I/O natively, the ADLS-fsspec method will remain faster for high-throughput scenarios like writing Zarr from xarray.​