cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Moving files using DBUtils is so slow

murtadha_s
New Contributor

I am using the platform DBUtils.fs.mv() on databricks clusters, and facing issues with move operation slowness.
I move files in UC Volumes or ADLS storage abfss links, which work but is so slow.
I mean it takes hours to transfer files that used to take minutes when done on HDFS.
What is the best solution for this ? and what could've caused it?

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @murtadha_s , here are some helpfult tips and hints to help you further diagnose the slowness.  

Totally expected behavior here: object-storage moves with dbutils.fs.mv will be much slower than HDFS. Under the hood, dbutils isn’t doing an atomic rename – it’s doing a full copy and then a delete. So when you move a large directory, it has to walk every file and move every byte, which is dramatically slower than HDFS, where a “move” is basically just a metadata update.

What likely caused the slowness

A move in this world is really just a copy followed by a delete, even when everything lives in the same filesystem. So instead of a quick server-side rename, every single byte gets pushed through the ABFS connector. That alone changes the performance profile pretty dramatically.

On top of that, dbutils.fs recursive calls run single-threaded by default. If you’re dealing with a big directory tree full of tiny files, that one-lane highway becomes the bottleneck unless you intentionally parallelize the work.

When you layer in Unity Catalog volumes, you hit another limitation: many of these I/O operations still run on the driver, not executors. So bulk moves can get throttled simply because the driver is doing all the heavy lifting.

And if you step outside volumes and operate directly on abfss:// paths under external locations, you add one more tax to the system — extra permission checks on every access. At low scale, no big deal. At high file counts, you definitely feel it.

Best-practice solutions (fastest to most practical)

If the source and destination live in the same ADLS Gen2 filesystem, the quickest path is to use Azure’s native server-side moves. The Azure CLI can do true fast renames without shuttling data through the cluster. Something like:

az storage fs directory move -n  -f  –new-directory “/” –account-name  –auth-mode login

Because this executes fully inside ADLS, it’s dramatically faster than dbutils.fs.mv, which has to copy data through the Spark cluster before cleaning up.

If you do need to run the move from Databricks, you can at least help yourself by enabling parallel recursive dbutils.fs operations so the driver can fan out the work:

# Enable parallel recursive cp/mv/rm on the driver
spark.conf.set("spark.databricks.service.dbutils.fs.parallel.enabled", True)

# Now run a recursive move
dbutils.fs.mv("/path/src/", "/path/dst/", True)

A couple things to keep in mind here:

  • This parallel mode kicks in only when you call the operation from the driver with recurse=True. On large folder trees, it can give you an order-of-magnitude improvement.
  • For volume paths, the picture changes a bit. Volumes don’t distribute dbutils.fs calls to executors, so some of the parallelization benefits can be muted.
  • If you’re working with volumes, you may be better served by the Databricks Files REST API or the Databricks CLI (fs). Both are designed for file management on volumes and make it easier to build reliable scripted workflows.
  • And avoid using shell-level moves like %sh mv for anything involving volumes — they aren’t supported across volume boundaries. Stick with dbutils.fs operations or Azure-native moves when you’re working directly against abfss:// paths.

Practical decision guide

If you’re moving data within the same ADLS filesystem, your best bet is the Azure CLI server-side move. It performs a true fast rename and avoids pushing bytes through the cluster.

If you’re crossing containers, accounts, regions, or hopping between volume paths, things get heavier. When you can run the operation outside Databricks, lean on Azure’s own tooling — CLI, Storage Explorer, anything that lets you parallelize and skip the data-tromboning back through the cluster.

If you need to stay inside Databricks, enable the parallel dbutils mode and use recursive moves. Always test on a small sample first so you get a feel for performance under your specific runtime, permissions, and volume constraints.

Why HDFS was faster

HDFS treats a move as a metadata rename, so it’s basically instant. Cloud storage plays by different rules. When you call dbutils.fs.mv, it isn’t doing a rename at all — it’s doing a full copy followed by a delete. That means performance scales with every byte and every file, not just a quick metadata tweak.

 
Hope this helps you better understand what is going on.
 
Cheers, Lou.