cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best pattern for ingesting data from hundreds of separate ADLS Gen2 containers into Databricks?

datastrange
New Contributor

We're building a lakehouse on Azure Databricks with Unity Catalog. Our data lands in Azure Data Lake Storage Gen2 (Hierarchical Namespace enabled) as JSON files. The challenge is multi-tenancy: each tenant's data is written to a separate container in the same storage account, following a naming convention like prefix-tenantA, prefix-tenantB, etc. We currently have a handful of tenants in dev but expect to scale to a few hundred.

We need to get all this data into Databricks, ideally into shared tables (all tenants in one table, with a tenant-name column to distinguish them).

What we've tried:

  1. Autoloader with container-level wildcards (abfss://prefix-*@account.dfs.core.windows.net/common_path/...) — does not work. Wildcards are not supported in the container portion of the ABFSS path.
  2. Single Autoloader with multiple paths (string-splitting or passing a list of container paths) — only reads from the first container.
  3. One Autoloader per container — this works, but raises concerns about scale: can we run 100+ Autoloader pipelines efficiently? What are the compute cost and monitoring implications?

What we're considering:

  • A config-driven approach where a configuration file lists all tenant names, and a deployment process creates/updates a DLT pipeline per tenant automatically.
  • Alternatively, restructuring storage so tenants are directories inside one container instead of separate containers (but customers prefer container-level isolation).

Our questions:

  1. Is there a recommended pattern for ingesting from many separate containers into Databricks? Are there limits on how many Autoloader streams can run concurrently?
  2. Could Unity Catalog external locations or volumes be used to abstract over multiple containers without running separate Autoloader instances — for example, mounting all tenant containers as a single logical location?
  3. For those running multi-tenant Databricks lakehouses at scale: how do you handle per-tenant ingestion? Separate pipelines, a single parameterized pipeline, or something else entirely?

Environment: Azure Databricks, Unity Catalog, ADLS Gen2 with HNS, DLT pipelines deployed via Databricks Asset Bundles, managed identity authentication.

Any guidance or experience reports appreciated. We'd especially like to hear from anyone running 50+ concurrent Autoloader streams.

1 REPLY 1

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @datastrange,

Great question -- this is a common architectural challenge in multi-tenant Azure Databricks environments, and you have already identified the key constraint: Auto Loader does not support wildcards in the container portion of the abfss:// path. The container name must be fully specified because it is part of the Azure storage endpoint (it maps to a DNS name, not a file path). So abfss://prefix-*@account.dfs.core.windows.net/... will never work -- that is by design in the ABFSS protocol, not an Auto Loader limitation.

Here is a breakdown of recommended patterns:


PATTERN 1 (RECOMMENDED): SINGLE DLT PIPELINE WITH MULTIPLE APPEND FLOWS

This is the most elegant pattern for your scenario. In Lakeflow Declarative Pipelines (formerly DLT), you can use append flows to fan multiple Auto Loader sources into a single streaming table. The key insight is that you can use a Python for loop to dynamically generate flows at pipeline definition time.

from pyspark import pipelines as dp
from pyspark.sql.functions import lit

# Define your tenant list -- could also be loaded from a config table
tenants = ["tenantA", "tenantB", "tenantC"] # Scale to hundreds

STORAGE_ACCOUNT = "youraccount"
COMMON_PATH = "data/events"

# Create the shared target streaming table once
dp.create_streaming_table("all_tenant_events")

# Dynamically create one append flow per tenant
for tenant in tenants:
@DP.append_flow(target="all_tenant_events", name=f"ingest_{tenant}")
def create_flow(tenant_name=tenant):
path = f"abfss://prefix-{tenant_name}@{STORAGE_ACCOUNT}.dfs.core.windows.net/{COMMON_PATH}"
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.load(path)
.withColumn("tenant_name", lit(tenant_name))
)

Why this works well:

- All tenants land in a single streaming table with a tenant_name discriminator column -- exactly what you want
- Each append flow maintains its own checkpoint, so a failure in one tenant does not block others
- The documentation states: "Any number of append flows can write to a particular target"
- A workspace supports up to 200 concurrent pipeline updates

Scaling considerations:

- Each append flow is a separate streaming micro-batch inside the pipeline. With hundreds of flows, you will need a cluster with enough cores/memory. Serverless compute with enhanced autoscaling is recommended.
- Consider using triggered mode rather than continuous processing. Schedule your pipeline to run periodically -- it processes all pending files across all tenants and then shuts down, which is more cost-effective.
- For file notification mode on Azure, there is a limit of 500 concurrent file notification pipelines per storage account using classic notifications. Using managed file events (cloudFiles.useManagedFileEvents = true) avoids this per-stream limit. Requires DBR 14.3 LTS+ and Unity Catalog external locations with file events enabled.

Loading the tenant list dynamically:

# Option A: Load from a config Delta table
tenant_df = spark.read.table("config.tenants")
tenants = [row.tenant_name for row in tenant_df.collect()]

# Option B: List containers from Azure at pipeline definition time
from azure.storage.blob import BlobServiceClient
blob_service = BlobServiceClient(account_url=f"https://{STORAGE_ACCOUNT}.blob.core.windows.net", credential=...)
tenants = [c.name.replace("prefix-", "") for c in blob_service.list_containers(name_starts_with="prefix-")]

Note: the tenant list is evaluated at pipeline definition time (when the pipeline starts). To pick up new tenants, restart/update the pipeline.

Docs:
- Append Flows: https://docs.databricks.com/en/ldp/flows.html
- DLT Best Practices: https://docs.databricks.com/en/ldp/best-practices.html
- Auto Loader Production: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html
- File Notification Mode: https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-n...


PATTERN 2: CONFIG-DRIVEN MULTIPLE PIPELINES VIA DATABRICKS ASSET BUNDLES

If you prefer stronger isolation between tenants (separate failure domains, independent scheduling, per-tenant monitoring), use Databricks Asset Bundles to deploy one pipeline per tenant from a parameterized template.

In your databricks.yml:

variables:
tenant_name:
description: "Tenant identifier"
default: "tenantA"

resources:
pipelines:
tenant_ingestion:
name: "ingestion-${var.tenant_name}"
target: "bronze"
configuration:
tenant_name: "${var.tenant_name}"
storage_account: "youraccount"
libraries:
- notebook:
path: ./notebooks/ingest_tenant.py

Then deploy multiple instances:

for tenant in tenantA tenantB tenantC; do
databricks bundle deploy --var="tenant_name=$tenant" --target prod
done

This gives per-tenant failure isolation and independent scheduling, but at the cost of more compute resources (each pipeline has its own cluster).

Docs:
- Asset Bundles: https://docs.databricks.com/en/dev-tools/bundles/index.html
- Bundle Variables: https://docs.databricks.com/en/dev-tools/bundles/variables.html


PATTERN 3: RESTRUCTURE STORAGE (RECOMMENDED LONG-TERM)

If you have influence over the storage layout, moving tenants from separate containers into directories within a single container is the cleanest long-term solution:

abfss://data@account.dfs.core.windows.net/tenants/tenantA/events/...
abfss://data@account.dfs.core.windows.net/tenants/tenantB/events/...

This unlocks:
- A single Auto Loader stream with recursiveFileLookup=true, using input_file_name() to extract the tenant name
- Simplified Unity Catalog governance with a single external location
- No concerns about per-container notification limits

You can still maintain tenant-level access isolation using Azure RBAC with ADLS Gen2 ACLs on directories (since you have HNS enabled).


ANSWERING YOUR SPECIFIC QUESTIONS

Q1: Is there a recommended pattern? Are there limits on concurrent Autoloader streams?

The recommended pattern is Pattern 1 (append flows in a single DLT pipeline). There is no hard documented limit on Auto Loader streams within a single pipeline, but practical limits depend on cluster resources. For classic file notification mode, Azure has a limit of 500 per storage account. Using managed file events avoids this limit.

Q2: Could Unity Catalog external locations or volumes abstract over multiple containers?

Partially. You can create one external location per container, but each maps to exactly one storage path -- no wildcard or multi-container abstraction. The benefit is governance: you use a single storage credential (via Azure Access Connector with managed identity) referenced by all external locations, and Unity Catalog governs access via READ FILES permissions.

Q3: How do others handle per-tenant ingestion at scale?

The most common patterns are:
1. Single parameterized DLT pipeline with append flows (Pattern 1) -- best for cost efficiency when tenants share the same schema
2. Multiple parameterized pipelines via Asset Bundles (Pattern 2) -- best when tenants need isolated failure domains or have different schemas
3. Restructured storage with directory-based tenancy (Pattern 3) -- best long-term if you can influence the storage architecture


DOCUMENTATION REFERENCES

- Auto Loader overview: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html
- Auto Loader options (Azure): https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/option...
- Auto Loader file notification mode: https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-n...
- DLT Append Flows: https://docs.databricks.com/en/ldp/flows.html
- DLT Limitations: https://docs.databricks.com/en/ldp/limitations.html
- DLT Best Practices: https://docs.databricks.com/en/ldp/best-practices.html
- External Locations (Azure): https://learn.microsoft.com/en-us/azure/databricks/connect/unity-catalog/cloud-storage/external-loca...
- Asset Bundles: https://docs.databricks.com/en/dev-tools/bundles/index.html

Hope this helps -- let me know if you have follow-up questions on any of these patterns!

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.