cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Finding all folder paths in a blob store connected via UC external connetion

turagittech
Contributor

Hi All,

I need to easily find all the paths in a blob store to find the files and load them. I have tried using Azure Blob storage connection in python and I have a solution that works it is very slow. I was speaking to a data engineer, and he suggested I try using external connections to access the storage, but due to not being able to use hierarchical files system in the blob store I cannot use spark.read.load() to load all files as that requires HFS enabled and we can't on this storage.

The obvious candidate is to use os.walk, but that doesn't work, well not as you would use it for any regular filesystem. I tried  using the abfss path as root path. 

I could do it with the Azure Storage library and BlobServiceClient, but looking for any alternatives.

If anyone has a solution to geting all paths. if there is a solution that someone has worked out any tips would be great

def getdir_tree(root_path😞
  path_list = []
  path = os.path.join(root_path, 'targetdirectory')
  for path, subdirs, files in os.walk(root_path):
    for name in files:
            path_list.append(os.path.join(path, name))
1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

The most efficient way to list all file paths in an Azure Blob Storage container from Databricks, especially when Hierarchical Namespace (HNS) is not enabled, is to use Azure SDKs targeting the blob flat namespace directly rather than filesystem protocols. Using os.walk or Spark's HDFS API commands won't work correctly because blob storage isn't a native filesystem and lacks a true recursive walk feature unless HNS is activated.

Recommended Approach: Azure SDK – List Blobs Flat

Instead of walking the directory tree, list blobs using the BlobServiceClient from azure-storage-blob, with the list_blobs method. This works whether or not the HNS is enabled and is highly performant for large containers:

python
from azure.storage.blob import BlobServiceClient def list_all_blob_paths(container_name, prefix='', connection_string=''): blob_service_client = BlobServiceClient.from_connection_string(connection_string) container_client = blob_service_client.get_container_client(container_name) paths = [] # list_blobs returns flat listing of all blobs under prefix for blob in container_client.list_blobs(name_starts_with=prefix): paths.append(blob.name) return paths
  • container_name: Name of your container

  • prefix: If you want to limit to a sub "directory"

  • connection_string: Azure Storage connection string

This approach:

  • Does not depend on HNS.

  • Will work for any Blob storage account.

  • Is orders of magnitude faster than recursive hacks, since listing blobs is a flat operation.

Alternative: Databricks Utilities (dbutils)

Databricks includes the dbutils.fs.ls() utility for file listing, and it works with mounted storage (e.g., "wasbs://", "abfss://") but it only lists one directory at a time — it does not recurse. You would have to recursively call it yourself, but this can be slow for deep container structures:

python
def recursive_ls(path): all_files = [] try: files = dbutils.fs.ls(path) for f in files: if f.isDir(): all_files.extend(recursive_ls(f.path)) else: all_files.append(f.path) except Exception as e: print(e) return all_files
  • For large containers, this can hit performance and API limits and is not ideal unless your directory structure is shallow.

Best Practices and Tips

  • Use Azure SDKs for "listing" blobs, not filesystem commands.

  • If you must "load" all files into Spark, you can get the paths using the SDK and then pass them to spark.read.* using a list of paths (but this only works for supported formats, not parquet/orc without HNS).

  • For huge numbers of blobs, consider paged or async listing with Azure SDKs.

Summary Table

Method Works Without HNS True Recursion Performance Code Required
Azure list_blobs Yes Yes (flat) Fast Moderate
Databricks dbutils.fs.ls Yes No Slow (deep) Easy
os.walk() (on abfss/wasbs) No No N/A N/A
 
 

For best results, use the Azure blob API (list_blobs) for listing and loading blobs when working in Databricks with non-HNS accounts.


Azure Blob Storage flat listing documentation, Databricks forums, and best practice articles for listing files in blob storage without HNS.