Hi @datastrange,
Great question -- this is a common architectural challenge in multi-tenant Azure Databricks environments, and you have already identified the key constraint: Auto Loader does not support wildcards in the container portion of the abfss:// path. The container name must be fully specified because it is part of the Azure storage endpoint (it maps to a DNS name, not a file path). So abfss://prefix-*@account.dfs.core.windows.net/... will never work -- that is by design in the ABFSS protocol, not an Auto Loader limitation.
Here is a breakdown of recommended patterns:
PATTERN 1 (RECOMMENDED): SINGLE DLT PIPELINE WITH MULTIPLE APPEND FLOWS
This is the most elegant pattern for your scenario. In Lakeflow Declarative Pipelines (formerly DLT), you can use append flows to fan multiple Auto Loader sources into a single streaming table. The key insight is that you can use a Python for loop to dynamically generate flows at pipeline definition time.
from pyspark import pipelines as dp
from pyspark.sql.functions import lit
# Define your tenant list -- could also be loaded from a config table
tenants = ["tenantA", "tenantB", "tenantC"] # Scale to hundreds
STORAGE_ACCOUNT = "youraccount"
COMMON_PATH = "data/events"
# Create the shared target streaming table once
dp.create_streaming_table("all_tenant_events")
# Dynamically create one append flow per tenant
for tenant in tenants:
@DP.append_flow(target="all_tenant_events", name=f"ingest_{tenant}")
def create_flow(tenant_name=tenant):
path = f"abfss://prefix-{tenant_name}@{STORAGE_ACCOUNT}.dfs.core.windows.net/{COMMON_PATH}"
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.load(path)
.withColumn("tenant_name", lit(tenant_name))
)
Why this works well:
- All tenants land in a single streaming table with a tenant_name discriminator column -- exactly what you want
- Each append flow maintains its own checkpoint, so a failure in one tenant does not block others
- The documentation states: "Any number of append flows can write to a particular target"
- A workspace supports up to 200 concurrent pipeline updates
Scaling considerations:
- Each append flow is a separate streaming micro-batch inside the pipeline. With hundreds of flows, you will need a cluster with enough cores/memory. Serverless compute with enhanced autoscaling is recommended.
- Consider using triggered mode rather than continuous processing. Schedule your pipeline to run periodically -- it processes all pending files across all tenants and then shuts down, which is more cost-effective.
- For file notification mode on Azure, there is a limit of 500 concurrent file notification pipelines per storage account using classic notifications. Using managed file events (cloudFiles.useManagedFileEvents = true) avoids this per-stream limit. Requires DBR 14.3 LTS+ and Unity Catalog external locations with file events enabled.
Loading the tenant list dynamically:
# Option A: Load from a config Delta table
tenant_df = spark.read.table("config.tenants")
tenants = [row.tenant_name for row in tenant_df.collect()]
# Option B: List containers from Azure at pipeline definition time
from azure.storage.blob import BlobServiceClient
blob_service = BlobServiceClient(account_url=f"https://{STORAGE_ACCOUNT}.blob.core.windows.net", credential=...)
tenants = [c.name.replace("prefix-", "") for c in blob_service.list_containers(name_starts_with="prefix-")]
Note: the tenant list is evaluated at pipeline definition time (when the pipeline starts). To pick up new tenants, restart/update the pipeline.
Docs:
- Append Flows: https://docs.databricks.com/en/ldp/flows.html
- DLT Best Practices: https://docs.databricks.com/en/ldp/best-practices.html
- Auto Loader Production: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html
- File Notification Mode: https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-n...
PATTERN 2: CONFIG-DRIVEN MULTIPLE PIPELINES VIA DATABRICKS ASSET BUNDLES
If you prefer stronger isolation between tenants (separate failure domains, independent scheduling, per-tenant monitoring), use Databricks Asset Bundles to deploy one pipeline per tenant from a parameterized template.
In your databricks.yml:
variables:
tenant_name:
description: "Tenant identifier"
default: "tenantA"
resources:
pipelines:
tenant_ingestion:
name: "ingestion-${var.tenant_name}"
target: "bronze"
configuration:
tenant_name: "${var.tenant_name}"
storage_account: "youraccount"
libraries:
- notebook:
path: ./notebooks/ingest_tenant.py
Then deploy multiple instances:
for tenant in tenantA tenantB tenantC; do
databricks bundle deploy --var="tenant_name=$tenant" --target prod
done
This gives per-tenant failure isolation and independent scheduling, but at the cost of more compute resources (each pipeline has its own cluster).
Docs:
- Asset Bundles: https://docs.databricks.com/en/dev-tools/bundles/index.html
- Bundle Variables: https://docs.databricks.com/en/dev-tools/bundles/variables.html
PATTERN 3: RESTRUCTURE STORAGE (RECOMMENDED LONG-TERM)
If you have influence over the storage layout, moving tenants from separate containers into directories within a single container is the cleanest long-term solution:
abfss://data@account.dfs.core.windows.net/tenants/tenantA/events/...
abfss://data@account.dfs.core.windows.net/tenants/tenantB/events/...
This unlocks:
- A single Auto Loader stream with recursiveFileLookup=true, using input_file_name() to extract the tenant name
- Simplified Unity Catalog governance with a single external location
- No concerns about per-container notification limits
You can still maintain tenant-level access isolation using Azure RBAC with ADLS Gen2 ACLs on directories (since you have HNS enabled).
ANSWERING YOUR SPECIFIC QUESTIONS
Q1: Is there a recommended pattern? Are there limits on concurrent Autoloader streams?
The recommended pattern is Pattern 1 (append flows in a single DLT pipeline). There is no hard documented limit on Auto Loader streams within a single pipeline, but practical limits depend on cluster resources. For classic file notification mode, Azure has a limit of 500 per storage account. Using managed file events avoids this limit.
Q2: Could Unity Catalog external locations or volumes abstract over multiple containers?
Partially. You can create one external location per container, but each maps to exactly one storage path -- no wildcard or multi-container abstraction. The benefit is governance: you use a single storage credential (via Azure Access Connector with managed identity) referenced by all external locations, and Unity Catalog governs access via READ FILES permissions.
Q3: How do others handle per-tenant ingestion at scale?
The most common patterns are:
1. Single parameterized DLT pipeline with append flows (Pattern 1) -- best for cost efficiency when tenants share the same schema
2. Multiple parameterized pipelines via Asset Bundles (Pattern 2) -- best when tenants need isolated failure domains or have different schemas
3. Restructured storage with directory-based tenancy (Pattern 3) -- best long-term if you can influence the storage architecture
DOCUMENTATION REFERENCES
- Auto Loader overview: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html
- Auto Loader options (Azure): https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/option...
- Auto Loader file notification mode: https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-n...
- DLT Append Flows: https://docs.databricks.com/en/ldp/flows.html
- DLT Limitations: https://docs.databricks.com/en/ldp/limitations.html
- DLT Best Practices: https://docs.databricks.com/en/ldp/best-practices.html
- External Locations (Azure): https://learn.microsoft.com/en-us/azure/databricks/connect/unity-catalog/cloud-storage/external-loca...
- Asset Bundles: https://docs.databricks.com/en/dev-tools/bundles/index.html
Hope this helps -- let me know if you have follow-up questions on any of these patterns!
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.