cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Slow writes to managed volume

yashojha
New Contributor III

Hi All, 

I am using managed volumes as an intermediate storage to write a decrypted file before moving to data lake storage. Strangely the write operation is taking a lot of time (22 mins) to write a small file to volumes and it takes only few seconds to decrypt and move the same file. 

The same operation is lower environment is working perfectly fine. Can anyone please help me identify the issue?

SS for reference:

yashojha_1-1771488017215.png

yashojha_3-1771488035100.png

 

Thanks,
Yash Ojha
3 REPLIES 3

yashojha
New Contributor III

I am using DBR 16.4LTS

Thanks,
Yash Ojha

saurabh18cs
Honored Contributor III

hi  why are you writing to intermediary first and not directly to your external data lake storage which is also blob storage.? with little bit of context i can say thisIs this for a single file copy or multi-files? 

From a parallelization perspective, you can use spark with a UDF. Create a dataframe with file paths as rows then run a UDF that will run shutil copy function for each path ( dbutils will not work within a UDF). That way the whole cluster will be used to parallelize file transfer (distributing the cpu, disk and network bandwidth usage). 

For single threaded driver side operation either shutil or dbutils can work. You can also do driver side multi-threading with asyncio, but you will be bounded only by the driver node capacity (+ network capacity). 

@yashojha

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @yashojha,

Thanks for the detailed writeup. A 22-minute write for a small file to a managed volume is definitely not expected behavior, especially when it works quickly in your lower environment. Let me walk through what is likely happening and how to troubleshoot and resolve it.


UNDERSTANDING THE BOTTLENECK

Unity Catalog managed volumes use a FUSE (Filesystem in Userspace) layer on the driver node that translates standard file system calls (open, write, close) into cloud object storage API calls under the hood. When you write a file to /Volumes/catalog/schema/volume/path, each write operation goes through this FUSE translation layer to the underlying cloud storage (Azure Blob/ADLS, S3, or GCS depending on your cloud provider).

The FUSE layer adds some overhead compared to writing to local disk, but 22 minutes for a small file indicates something beyond normal FUSE overhead. Since it works fine in your lower environment, the difference is almost certainly in the infrastructure or network configuration between the two environments.


INVESTIGATION STEPS

1. CHECK NETWORK AND STORAGE CONFIGURATION

This is the most likely cause. Compare these between your lower environment and the problematic one:

- Is the production workspace using a VNet/VPC with restrictive firewall rules or a private endpoint for storage?
- Is the managed storage account behind a firewall or private link that introduces routing latency?
- Are there NSG (Network Security Group) rules or route tables that force storage traffic through a firewall appliance or inspection layer?
- Is DNS resolution for the storage endpoint going through a custom DNS that may be slow?

To check where your managed volume data is stored, run:

DESCRIBE SCHEMA EXTENDED your_catalog.your_schema;

Look at the "Managed Location" in the output. Then verify that the cluster has efficient, direct network connectivity to that storage account.

2. IDENTIFY THE WRITE METHOD

How you write the file matters significantly. If your code writes the decrypted file using Python open() with many small write calls, each write may become an individual API call through FUSE. This is much slower than a single bulk write.

For example, this pattern is slow:

with open("/Volumes/catalog/schema/vol/file.dat", "wb") as f:
for chunk in decrypt_stream(encrypted_data):
f.write(chunk) # Each small write goes through FUSE

3. CHECK DRIVER NODE RESOURCES

If the driver node is undersized or under memory pressure, FUSE operations can slow down. Check the Spark UI metrics tab during the write to see if the driver is resource-constrained.

4. RUN A DIAGNOSTIC TEST

Run this quick test to isolate whether the issue is FUSE/storage or your decryption code:

import time

# Test 1: Write a simple test file to the volume
test_data = b"x" * (10 * 1024 * 1024) # 10 MB of test data

start = time.time()
with open("/Volumes/your_catalog/your_schema/your_volume/test_file.bin", "wb") as f:
f.write(test_data)
elapsed_volume = time.time() - start
print(f"Volume write: {elapsed_volume:.2f} seconds")

# Test 2: Write to local ephemeral disk for comparison
start = time.time()
with open("/local_disk0/test_file.bin", "wb") as f:
f.write(test_data)
elapsed_local = time.time() - start
print(f"Local disk write: {elapsed_local:.2f} seconds")

# Test 3: Copy from local disk to volume using dbutils
start = time.time()
dbutils.fs.cp("file:/local_disk0/test_file.bin", "/Volumes/your_catalog/your_schema/your_volume/test_file2.bin")
elapsed_copy = time.time() - start
print(f"dbutils.fs.cp to volume: {elapsed_copy:.2f} seconds")

If Test 1 is slow but Test 3 is fast, the issue is with how the FUSE layer handles the write pattern. If both are slow, the issue is network/storage connectivity.


RECOMMENDED SOLUTIONS

OPTION A: WRITE TO LOCAL DISK FIRST, THEN COPY (QUICK FIX)

This is the simplest approach and often the fastest. Write your decrypted file to the driver's local ephemeral storage first, then use dbutils.fs.cp to move it to the volume in a single optimized transfer:

import shutil

# Step 1: Decrypt to local ephemeral disk (fast, no FUSE overhead)
local_path = "/local_disk0/tmp/decrypted_file.dat"
with open(local_path, "wb") as f:
f.write(decrypted_data)

# Step 2: Copy to volume in one bulk operation
volume_path = "/Volumes/your_catalog/your_schema/your_volume/decrypted_file.dat"
dbutils.fs.cp(f"file:{local_path}", volume_path)

OPTION B: WRITE DIRECTLY TO CLOUD STORAGE (SKIP VOLUME AS INTERMEDIATE)

Since you mention the volume is just intermediate storage before moving to your data lake, consider writing directly to your final destination using dbutils.fs or the cloud SDK. This eliminates the intermediate step entirely.

OPTION C: USE LARGER WRITE BUFFERS

If you must write through FUSE, use larger buffer sizes to reduce the number of individual API calls:

import io

# Buffer in memory, then write in one shot
buffer = io.BytesIO()
buffer.write(decrypted_data)

with open("/Volumes/catalog/schema/vol/file.dat", "wb") as f:
f.write(buffer.getvalue())


DOCUMENTATION REFERENCES

- Unity Catalog Volumes: https://docs.databricks.com/en/volumes/index.html
- Work with files on Databricks: https://docs.databricks.com/en/files/index.html
- Databricks Utilities (dbutils.fs): https://docs.databricks.com/en/dev-tools/databricks-utils.html


I hope one of these approaches resolves the performance issue for you. Given that it works in your lower environment, I would start with Investigation Step 1 (comparing the network and storage configuration between environments) as that is the most common explanation for this kind of discrepancy. In the meantime, Option A (local disk write then copy) should give you a quick improvement while you track down the root cause.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.