Hi @Nkrom
This is a very common issue! You are running into this because of how dbutils.fs.mv interacts with object storage. When you use it on a massive folder, it executes sequentially (one file at a time).
Even worse, dbutils commands execute entirely on the Driver Node. Your worker nodes do absolutely nothing during this process. If your cluster has a small driver node, it will severely bottleneck the operation, hence the 7-hour wait!
If you don't have access to the raw Azure Storage Keys to use the Azure REST API directly, the best workaround is to use Python's ThreadPoolExecutor to run dbutils.fs.mv in parallel across multiple threads.
Here is a script you can use in your workflow to swap the folders much faster:
from concurrent.futures import ThreadPoolExecutor
def fast_move_contents(source_dir, target_dir, workers=16):
# Ensure target directory exists
dbutils.fs.mkdirs(target_dir)
# List all items (files/folders) in the top level of the source
items = dbutils.fs.ls(source_dir)
def move_item(item):
dbutils.fs.mv(item.path, f"{target_dir}/{item.name}", recurse=True)
# Process the moves in parallel using threads
with ThreadPoolExecutor(max_workers=workers) as executor:
executor.map(move_item, items)
# 1. Move 'customer' to a temp folder
fast_move_contents("abfss://container@storage.dfs.core.windows.net/customer", "abfss://container@storage.dfs.core.windows.net/customer_temp")
# 2. Move 'customer_01' to 'customer'
fast_move_contents("abfss://container@storage.dfs.core.windows.net/customer_01", "abfss://container@storage.dfs.core.windows.net/customer")
# 3. Move temp to 'customer_01'
fast_move_contents("abfss://container@storage.dfs.core.windows.net/customer_temp", "abfss://container@storage.dfs.core.windows.net/customer_01")
print("Parallel swap completed!")
Two Pro-Tips for your Workflow:
Cluster Sizing: Since this runs on the driver, make sure the cluster attached to your workflow has a memory-optimized or compute-optimized Driver Node. You can keep the worker nodes at zero (Single Node cluster) just for this specific task to save money!
Max Workers: You can safely increase workers=16 to 32 or 64 depending on how many cores your driver node has.
Hope this helps speed up your pipeline!