Databricks Community

Nkrom · an hour ago

Hi i have a folder customer and customer_01 in adls location , now i need to rename customer_01 to customer and customer to customer_01 both if these folder have lots of files . If i use dbutls.fs.mv its taking a lot of time like 7 hours something is there any way to do via notebook so that we can attach it in workflow and doesnt need to go to adls location and rename in from there

ShamenParis · 45m ago

Hi @Nkrom

This is a very common issue! You are running into this because of how dbutils.fs.mv interacts with object storage. When you use it on a massive folder, it executes sequentially (one file at a time).

Even worse, dbutils commands execute entirely on the Driver Node. Your worker nodes do absolutely nothing during this process. If your cluster has a small driver node, it will severely bottleneck the operation, hence the 7-hour wait!

If you don't have access to the raw Azure Storage Keys to use the Azure REST API directly, the best workaround is to use Python's ThreadPoolExecutor to run dbutils.fs.mv in parallel across multiple threads.

Here is a script you can use in your workflow to swap the folders much faster:

from concurrent.futures import ThreadPoolExecutor

def fast_move_contents(source_dir, target_dir, workers=16):
    # Ensure target directory exists
    dbutils.fs.mkdirs(target_dir)
    
    # List all items (files/folders) in the top level of the source
    items = dbutils.fs.ls(source_dir)
    
    def move_item(item):
        dbutils.fs.mv(item.path, f"{target_dir}/{item.name}", recurse=True)
        
    # Process the moves in parallel using threads
    with ThreadPoolExecutor(max_workers=workers) as executor:
        executor.map(move_item, items)

# 1. Move 'customer' to a temp folder
fast_move_contents("abfss://container@storage.dfs.core.windows.net/customer", "abfss://container@storage.dfs.core.windows.net/customer_temp")

# 2. Move 'customer_01' to 'customer'
fast_move_contents("abfss://container@storage.dfs.core.windows.net/customer_01", "abfss://container@storage.dfs.core.windows.net/customer")

# 3. Move temp to 'customer_01'
fast_move_contents("abfss://container@storage.dfs.core.windows.net/customer_temp", "abfss://container@storage.dfs.core.windows.net/customer_01")

print("Parallel swap completed!")

Two Pro-Tips for your Workflow:

Cluster Sizing: Since this runs on the driver, make sure the cluster attached to your workflow has a memory-optimized or compute-optimized Driver Node. You can keep the worker nodes at zero (Single Node cluster) just for this specific task to save money!
Max Workers: You can safely increase workers=16 to 32 or 64 depending on how many cores your driver node has.

Hope this helps speed up your pipeline!

Nkrom · 29m ago

Thanks for this how is that can azure rest api can you please mention that too

ShamenParis · 17m ago

Hi @Nkrom ,

I am happy to share the Azure REST API method! Using the Azure Python SDK is the absolute fastest way to do this but you can choose any other programming language.
ADLS Gen2 uses a "Hierarchical Namespace" (HNS). When you use the Azure SDK to rename a folder, it doesn't touch the files inside. It literally just updates the folder name in the metadata layer. What takes dbutils 7 hours will take this API about 2 seconds.

Here is how you do it in a Databricks Notebook:

First, you need to install the Azure Data Lake library. You can run this in the first cell of your notebook:

%pip install azure-storage-file-datalake

Next, use this script. Important: Never hardcode your storage key in the notebook. Always use dbutils.secrets.get() to pull it securely from your Databricks Key Vault! If not you can directly use the key.

from azure.storage.filedatalake import DataLakeServiceClient

# 1. Setup your credentials securely
storage_account = "<your_storage_account_name>"
container = "<your_container_name>"

# Pull the storage key from Databricks Secrets
storage_key = dbutils.secrets.get(scope="your_scope_name", key="your_secret_name")

# Create the client connection
service_client = DataLakeServiceClient(
    account_url=f"https://{storage_account}.dfs.core.windows.net", 
    credential=storage_key
)
file_system_client = service_client.get_file_system_client(file_system=container)

# 2. Get clients for your current directories
dir_customer = file_system_client.get_directory_client("customer")
dir_customer_01 = file_system_client.get_directory_client("customer_01")

# 3. Perform the atomic swap (This happens instantly!)
# Note: The new_name parameter requires the container name in the path
dir_customer.rename_directory(new_name=f"{container}/customer_temp")
dir_customer_01.rename_directory(new_name=f"{container}/customer")

# Get the temp folder and rename it to customer_01
dir_temp = file_system_client.get_directory_client("customer_temp")
dir_temp.rename_directory(new_name=f"{container}/customer_01")

print("Folders swapped instantly via Azure API!")

Quick note: If your company uses Service Principals (Managed Identities / Entra ID) instead of Storage Account Keys, you can just install the azure-identity library and swap the credential=storage_key out for credential=DefaultAzureCredential().

Give this a try in your workflow, it will save you a massive amount of cluster compute time! Let us know how it goes