Databricks Community

turagittech · ‎02-16-2025

Hi All,

I need to easily find all the paths in a blob store to find the files and load them. I have tried using Azure Blob storage connection in python and I have a solution that works it is very slow. I was speaking to a data engineer, and he suggested I try using external connections to access the storage, but due to not being able to use hierarchical files system in the blob store I cannot use spark.read.load() to load all files as that requires HFS enabled and we can't on this storage.

The obvious candidate is to use os.walk, but that doesn't work, well not as you would use it for any regular filesystem. I tried using the abfss path as root path.

I could do it with the Azure Storage library and BlobServiceClient, but looking for any alternatives.

If anyone has a solution to geting all paths. if there is a solution that someone has worked out any tips would be great

def getdir_tree(root_path😞

path_list = []

path = os.path.join(root_path, 'targetdirectory')

for path, subdirs, files in os.walk(root_path):

for name in files:

path_list.append(os.path.join(path, name))

mark_ott · Wednesday

The most efficient way to list all file paths in an Azure Blob Storage container from Databricks, especially when Hierarchical Namespace (HNS) is not enabled, is to use Azure SDKs targeting the blob flat namespace directly rather than filesystem protocols. Using os.walk or Spark's HDFS API commands won't work correctly because blob storage isn't a native filesystem and lacks a true recursive walk feature unless HNS is activated.

Recommended Approach: Azure SDK – List Blobs Flat

Instead of walking the directory tree, list blobs using the BlobServiceClient from azure-storage-blob, with the list_blobs method. This works whether or not the HNS is enabled and is highly performant for large containers:

python

from azure.storage.blob import BlobServiceClient

def list_all_blob_paths(container_name, prefix='', connection_string=''):
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    container_client = blob_service_client.get_container_client(container_name)
    paths = []
    # list_blobs returns flat listing of all blobs under prefix
    for blob in container_client.list_blobs(name_starts_with=prefix):
        paths.append(blob.name)
    return paths

container_name: Name of your container
prefix: If you want to limit to a sub "directory"
connection_string: Azure Storage connection string

This approach:

Does not depend on HNS.
Will work for any Blob storage account.
Is orders of magnitude faster than recursive hacks, since listing blobs is a flat operation.

Alternative: Databricks Utilities (dbutils)

Databricks includes the dbutils.fs.ls() utility for file listing, and it works with mounted storage (e.g., "wasbs://", "abfss://") but it only lists one directory at a time — it does not recurse. You would have to recursively call it yourself, but this can be slow for deep container structures:

python

def recursive_ls(path):
    all_files = []
    try:
        files = dbutils.fs.ls(path)
        for f in files:
            if f.isDir():
                all_files.extend(recursive_ls(f.path))
            else:
                all_files.append(f.path)
    except Exception as e:
        print(e)
    return all_files

For large containers, this can hit performance and API limits and is not ideal unless your directory structure is shallow.

Best Practices and Tips

Use Azure SDKs for "listing" blobs, not filesystem commands.
If you must "load" all files into Spark, you can get the paths using the SDK and then pass them to spark.read.* using a list of paths (but this only works for supported formats, not parquet/orc without HNS).
For huge numbers of blobs, consider paged or async listing with Azure SDKs.