The most efficient way to list all file paths in an Azure Blob Storage container from Databricks, especially when Hierarchical Namespace (HNS) is not enabled, is to use Azure SDKs targeting the blob flat namespace directly rather than filesystem protocols. Using os.walk or Spark's HDFS API commands won't work correctly because blob storage isn't a native filesystem and lacks a true recursive walk feature unless HNS is activated.
Recommended Approach: Azure SDK – List Blobs Flat
Instead of walking the directory tree, list blobs using the BlobServiceClient from azure-storage-blob, with the list_blobs method. This works whether or not the HNS is enabled and is highly performant for large containers:
from azure.storage.blob import BlobServiceClient
def list_all_blob_paths(container_name, prefix='', connection_string=''):
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
paths = []
# list_blobs returns flat listing of all blobs under prefix
for blob in container_client.list_blobs(name_starts_with=prefix):
paths.append(blob.name)
return paths
-
container_name: Name of your container
-
prefix: If you want to limit to a sub "directory"
-
connection_string: Azure Storage connection string
This approach:
-
Does not depend on HNS.
-
Will work for any Blob storage account.
-
Is orders of magnitude faster than recursive hacks, since listing blobs is a flat operation.
Alternative: Databricks Utilities (dbutils)
Databricks includes the dbutils.fs.ls() utility for file listing, and it works with mounted storage (e.g., "wasbs://", "abfss://") but it only lists one directory at a time — it does not recurse. You would have to recursively call it yourself, but this can be slow for deep container structures:
def recursive_ls(path):
all_files = []
try:
files = dbutils.fs.ls(path)
for f in files:
if f.isDir():
all_files.extend(recursive_ls(f.path))
else:
all_files.append(f.path)
except Exception as e:
print(e)
return all_files
-
For large containers, this can hit performance and API limits and is not ideal unless your directory structure is shallow.
Best Practices and Tips
-
Use Azure SDKs for "listing" blobs, not filesystem commands.
-
If you must "load" all files into Spark, you can get the paths using the SDK and then pass them to spark.read.* using a list of paths (but this only works for supported formats, not parquet/orc without HNS).
-
For huge numbers of blobs, consider paged or async listing with Azure SDKs.
Summary Table
| Method |
Works Without HNS |
True Recursion |
Performance |
Code Required |
Azure list_blobs |
Yes |
Yes (flat) |
Fast |
Moderate |
Databricks dbutils.fs.ls |
Yes |
No |
Slow (deep) |
Easy |
os.walk() (on abfss/wasbs) |
No |
No |
N/A |
N/A |
For best results, use the Azure blob API (list_blobs) for listing and loading blobs when working in Databricks with non-HNS accounts.
Azure Blob Storage flat listing documentation, Databricks forums, and best practice articles for listing files in blob storage without HNS.