<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Best pattern for ingesting data from hundreds of separate ADLS Gen2 containers into Databricks? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-pattern-for-ingesting-data-from-hundreds-of-separate-adls/m-p/150097#M53240</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/218845"&gt;@datastrange&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Great question -- this is a common architectural challenge in multi-tenant Azure Databricks environments, and you have already identified the key constraint: Auto Loader does not support wildcards in the container portion of the abfss:// path. The container name must be fully specified because it is part of the Azure storage endpoint (it maps to a DNS name, not a file path). So abfss://prefix-*@account.dfs.core.windows.net/... will never work -- that is by design in the ABFSS protocol, not an Auto Loader limitation.&lt;/P&gt;
&lt;P&gt;Here is a breakdown of recommended patterns:&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;PATTERN 1 (RECOMMENDED): SINGLE DLT PIPELINE WITH MULTIPLE APPEND FLOWS&lt;/P&gt;
&lt;P&gt;This is the most elegant pattern for your scenario. In Lakeflow Declarative Pipelines (formerly DLT), you can use append flows to fan multiple Auto Loader sources into a single streaming table. The key insight is that you can use a Python for loop to dynamically generate flows at pipeline definition time.&lt;/P&gt;
&lt;P&gt;from pyspark import pipelines as dp&lt;BR /&gt;from pyspark.sql.functions import lit&lt;/P&gt;
&lt;P&gt;# Define your tenant list -- could also be loaded from a config table&lt;BR /&gt;tenants = ["tenantA", "tenantB", "tenantC"] # Scale to hundreds&lt;/P&gt;
&lt;P&gt;STORAGE_ACCOUNT = "youraccount"&lt;BR /&gt;COMMON_PATH = "data/events"&lt;/P&gt;
&lt;P&gt;# Create the shared target streaming table once&lt;BR /&gt;dp.create_streaming_table("all_tenant_events")&lt;/P&gt;
&lt;P&gt;# Dynamically create one append flow per tenant&lt;BR /&gt;for tenant in tenants:&lt;BR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.append_flow(target="all_tenant_events", name=f"ingest_{tenant}")&lt;BR /&gt;def create_flow(tenant_name=tenant):&lt;BR /&gt;path = f"abfss://prefix-{tenant_name}@{STORAGE_ACCOUNT}.dfs.core.windows.net/{COMMON_PATH}"&lt;BR /&gt;return (&lt;BR /&gt;spark.readStream.format("cloudFiles")&lt;BR /&gt;.option("cloudFiles.format", "json")&lt;BR /&gt;.option("cloudFiles.inferColumnTypes", "true")&lt;BR /&gt;.load(path)&lt;BR /&gt;.withColumn("tenant_name", lit(tenant_name))&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;Why this works well:&lt;/P&gt;
&lt;P&gt;- All tenants land in a single streaming table with a tenant_name discriminator column -- exactly what you want&lt;BR /&gt;- Each append flow maintains its own checkpoint, so a failure in one tenant does not block others&lt;BR /&gt;- The documentation states: "Any number of append flows can write to a particular target"&lt;BR /&gt;- A workspace supports up to 200 concurrent pipeline updates&lt;/P&gt;
&lt;P&gt;Scaling considerations:&lt;/P&gt;
&lt;P&gt;- Each append flow is a separate streaming micro-batch inside the pipeline. With hundreds of flows, you will need a cluster with enough cores/memory. Serverless compute with enhanced autoscaling is recommended.&lt;BR /&gt;- Consider using triggered mode rather than continuous processing. Schedule your pipeline to run periodically -- it processes all pending files across all tenants and then shuts down, which is more cost-effective.&lt;BR /&gt;- For file notification mode on Azure, there is a limit of 500 concurrent file notification pipelines per storage account using classic notifications. Using managed file events (cloudFiles.useManagedFileEvents = true) avoids this per-stream limit. Requires DBR 14.3 LTS+ and Unity Catalog external locations with file events enabled.&lt;/P&gt;
&lt;P&gt;Loading the tenant list dynamically:&lt;/P&gt;
&lt;P&gt;# Option A: Load from a config Delta table&lt;BR /&gt;tenant_df = spark.read.table("config.tenants")&lt;BR /&gt;tenants = [row.tenant_name for row in tenant_df.collect()]&lt;/P&gt;
&lt;P&gt;# Option B: List containers from Azure at pipeline definition time&lt;BR /&gt;from azure.storage.blob import BlobServiceClient&lt;BR /&gt;blob_service = BlobServiceClient(account_url=f"https://{STORAGE_ACCOUNT}.blob.core.windows.net", credential=...)&lt;BR /&gt;tenants = [c.name.replace("prefix-", "") for c in blob_service.list_containers(name_starts_with="prefix-")]&lt;/P&gt;
&lt;P&gt;Note: the tenant list is evaluated at pipeline definition time (when the pipeline starts). To pick up new tenants, restart/update the pipeline.&lt;/P&gt;
&lt;P&gt;Docs:&lt;BR /&gt;- Append Flows: &lt;A href="https://docs.databricks.com/en/ldp/flows.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/flows.html&lt;/A&gt;&lt;BR /&gt;- DLT Best Practices: &lt;A href="https://docs.databricks.com/en/ldp/best-practices.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/best-practices.html&lt;/A&gt;&lt;BR /&gt;- Auto Loader Production: &lt;A href="https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html" target="_blank"&gt;https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html&lt;/A&gt;&lt;BR /&gt;- File Notification Mode: &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-notification-mode" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-notification-mode&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;PATTERN 2: CONFIG-DRIVEN MULTIPLE PIPELINES VIA DATABRICKS ASSET BUNDLES&lt;/P&gt;
&lt;P&gt;If you prefer stronger isolation between tenants (separate failure domains, independent scheduling, per-tenant monitoring), use Databricks Asset Bundles to deploy one pipeline per tenant from a parameterized template.&lt;/P&gt;
&lt;P&gt;In your databricks.yml:&lt;/P&gt;
&lt;P&gt;variables:&lt;BR /&gt;tenant_name:&lt;BR /&gt;description: "Tenant identifier"&lt;BR /&gt;default: "tenantA"&lt;/P&gt;
&lt;P&gt;resources:&lt;BR /&gt;pipelines:&lt;BR /&gt;tenant_ingestion:&lt;BR /&gt;name: "ingestion-${var.tenant_name}"&lt;BR /&gt;target: "bronze"&lt;BR /&gt;configuration:&lt;BR /&gt;tenant_name: "${var.tenant_name}"&lt;BR /&gt;storage_account: "youraccount"&lt;BR /&gt;libraries:&lt;BR /&gt;- notebook:&lt;BR /&gt;path: ./notebooks/ingest_tenant.py&lt;/P&gt;
&lt;P&gt;Then deploy multiple instances:&lt;/P&gt;
&lt;P&gt;for tenant in tenantA tenantB tenantC; do&lt;BR /&gt;databricks bundle deploy --var="tenant_name=$tenant" --target prod&lt;BR /&gt;done&lt;/P&gt;
&lt;P&gt;This gives per-tenant failure isolation and independent scheduling, but at the cost of more compute resources (each pipeline has its own cluster).&lt;/P&gt;
&lt;P&gt;Docs:&lt;BR /&gt;- Asset Bundles: &lt;A href="https://docs.databricks.com/en/dev-tools/bundles/index.html" target="_blank"&gt;https://docs.databricks.com/en/dev-tools/bundles/index.html&lt;/A&gt;&lt;BR /&gt;- Bundle Variables: &lt;A href="https://docs.databricks.com/en/dev-tools/bundles/variables.html" target="_blank"&gt;https://docs.databricks.com/en/dev-tools/bundles/variables.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;PATTERN 3: RESTRUCTURE STORAGE (RECOMMENDED LONG-TERM)&lt;/P&gt;
&lt;P&gt;If you have influence over the storage layout, moving tenants from separate containers into directories within a single container is the cleanest long-term solution:&lt;/P&gt;
&lt;P&gt;abfss://data@account.dfs.core.windows.net/tenants/tenantA/events/...&lt;BR /&gt;abfss://data@account.dfs.core.windows.net/tenants/tenantB/events/...&lt;/P&gt;
&lt;P&gt;This unlocks:&lt;BR /&gt;- A single Auto Loader stream with recursiveFileLookup=true, using input_file_name() to extract the tenant name&lt;BR /&gt;- Simplified Unity Catalog governance with a single external location&lt;BR /&gt;- No concerns about per-container notification limits&lt;/P&gt;
&lt;P&gt;You can still maintain tenant-level access isolation using Azure RBAC with ADLS Gen2 ACLs on directories (since you have HNS enabled).&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;ANSWERING YOUR SPECIFIC QUESTIONS&lt;/P&gt;
&lt;P&gt;Q1: Is there a recommended pattern? Are there limits on concurrent Autoloader streams?&lt;/P&gt;
&lt;P&gt;The recommended pattern is Pattern 1 (append flows in a single DLT pipeline). There is no hard documented limit on Auto Loader streams within a single pipeline, but practical limits depend on cluster resources. For classic file notification mode, Azure has a limit of 500 per storage account. Using managed file events avoids this limit.&lt;/P&gt;
&lt;P&gt;Q2: Could Unity Catalog external locations or volumes abstract over multiple containers?&lt;/P&gt;
&lt;P&gt;Partially. You can create one external location per container, but each maps to exactly one storage path -- no wildcard or multi-container abstraction. The benefit is governance: you use a single storage credential (via Azure Access Connector with managed identity) referenced by all external locations, and Unity Catalog governs access via READ FILES permissions.&lt;/P&gt;
&lt;P&gt;Q3: How do others handle per-tenant ingestion at scale?&lt;/P&gt;
&lt;P&gt;The most common patterns are:&lt;BR /&gt;1. Single parameterized DLT pipeline with append flows (Pattern 1) -- best for cost efficiency when tenants share the same schema&lt;BR /&gt;2. Multiple parameterized pipelines via Asset Bundles (Pattern 2) -- best when tenants need isolated failure domains or have different schemas&lt;BR /&gt;3. Restructured storage with directory-based tenancy (Pattern 3) -- best long-term if you can influence the storage architecture&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;DOCUMENTATION REFERENCES&lt;/P&gt;
&lt;P&gt;- Auto Loader overview: &lt;A href="https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html" target="_blank"&gt;https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html&lt;/A&gt;&lt;BR /&gt;- Auto Loader options (Azure): &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/options" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/options&lt;/A&gt;&lt;BR /&gt;- Auto Loader file notification mode: &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-notification-mode" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-notification-mode&lt;/A&gt;&lt;BR /&gt;- DLT Append Flows: &lt;A href="https://docs.databricks.com/en/ldp/flows.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/flows.html&lt;/A&gt;&lt;BR /&gt;- DLT Limitations: &lt;A href="https://docs.databricks.com/en/ldp/limitations.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/limitations.html&lt;/A&gt;&lt;BR /&gt;- DLT Best Practices: &lt;A href="https://docs.databricks.com/en/ldp/best-practices.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/best-practices.html&lt;/A&gt;&lt;BR /&gt;- External Locations (Azure): &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/connect/unity-catalog/cloud-storage/external-locations" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/connect/unity-catalog/cloud-storage/external-locations&lt;/A&gt;&lt;BR /&gt;- Asset Bundles: &lt;A href="https://docs.databricks.com/en/dev-tools/bundles/index.html" target="_blank"&gt;https://docs.databricks.com/en/dev-tools/bundles/index.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Hope this helps -- let me know if you have follow-up questions on any of these patterns!&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;</description>
    <pubDate>Sat, 07 Mar 2026 20:22:50 GMT</pubDate>
    <dc:creator>SteveOstrowski</dc:creator>
    <dc:date>2026-03-07T20:22:50Z</dc:date>
    <item>
      <title>Best pattern for ingesting data from hundreds of separate ADLS Gen2 containers into Databricks?</title>
      <link>https://community.databricks.com/t5/data-engineering/best-pattern-for-ingesting-data-from-hundreds-of-separate-adls/m-p/149991#M53215</link>
      <description>&lt;P class=""&gt;We're building a lakehouse on Azure Databricks with Unity Catalog. Our data lands in Azure Data Lake Storage Gen2 (Hierarchical Namespace enabled) as JSON files. The challenge is multi-tenancy: each tenant's data is written to a &lt;STRONG&gt;separate container&lt;/STRONG&gt; in the same storage account, following a naming convention like prefix-tenantA, prefix-tenantB, etc. We currently have a handful of tenants in dev but expect to scale to a few hundred.&lt;/P&gt;&lt;P class=""&gt;We need to get all this data into Databricks, ideally into shared tables (all tenants in one table, with a tenant-name column to distinguish them).&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;What we've tried:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Autoloader with container-level wildcards&lt;/STRONG&gt; (abfss://prefix-*@account.dfs.core.windows.net/common_path/...) — does not work. Wildcards are not supported in the container portion of the ABFSS path.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Single Autoloader with multiple paths&lt;/STRONG&gt; (string-splitting or passing a list of container paths) — only reads from the first container.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;One Autoloader per container&lt;/STRONG&gt; — this works, but raises concerns about scale: can we run 100+ Autoloader pipelines efficiently? What are the compute cost and monitoring implications?&lt;/LI&gt;&lt;/OL&gt;&lt;P class=""&gt;&lt;STRONG&gt;What we're considering:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;A config-driven approach where a configuration file lists all tenant names, and a deployment process creates/updates a DLT pipeline per tenant automatically.&lt;/LI&gt;&lt;LI&gt;Alternatively, restructuring storage so tenants are directories inside one container instead of separate containers (but customers prefer container-level isolation).&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;STRONG&gt;Our questions:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;Is there a recommended pattern for ingesting from many separate containers into Databricks? Are there limits on how many Autoloader streams can run concurrently?&lt;/LI&gt;&lt;LI&gt;Could Unity Catalog external locations or volumes be used to abstract over multiple containers without running separate Autoloader instances — for example, mounting all tenant containers as a single logical location?&lt;/LI&gt;&lt;LI&gt;For those running multi-tenant Databricks lakehouses at scale: how do you handle per-tenant ingestion? Separate pipelines, a single parameterized pipeline, or something else entirely?&lt;/LI&gt;&lt;/OL&gt;&lt;P class=""&gt;&lt;STRONG&gt;Environment:&lt;/STRONG&gt; Azure Databricks, Unity Catalog, ADLS Gen2 with HNS, DLT pipelines deployed via Databricks Asset Bundles, managed identity authentication.&lt;/P&gt;&lt;P class=""&gt;Any guidance or experience reports appreciated. We'd especially like to hear from anyone running 50+ concurrent Autoloader streams.&lt;/P&gt;</description>
      <pubDate>Fri, 06 Mar 2026 12:23:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-pattern-for-ingesting-data-from-hundreds-of-separate-adls/m-p/149991#M53215</guid>
      <dc:creator>datastrange</dc:creator>
      <dc:date>2026-03-06T12:23:52Z</dc:date>
    </item>
    <item>
      <title>Re: Best pattern for ingesting data from hundreds of separate ADLS Gen2 containers into Databricks?</title>
      <link>https://community.databricks.com/t5/data-engineering/best-pattern-for-ingesting-data-from-hundreds-of-separate-adls/m-p/150097#M53240</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/218845"&gt;@datastrange&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Great question -- this is a common architectural challenge in multi-tenant Azure Databricks environments, and you have already identified the key constraint: Auto Loader does not support wildcards in the container portion of the abfss:// path. The container name must be fully specified because it is part of the Azure storage endpoint (it maps to a DNS name, not a file path). So abfss://prefix-*@account.dfs.core.windows.net/... will never work -- that is by design in the ABFSS protocol, not an Auto Loader limitation.&lt;/P&gt;
&lt;P&gt;Here is a breakdown of recommended patterns:&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;PATTERN 1 (RECOMMENDED): SINGLE DLT PIPELINE WITH MULTIPLE APPEND FLOWS&lt;/P&gt;
&lt;P&gt;This is the most elegant pattern for your scenario. In Lakeflow Declarative Pipelines (formerly DLT), you can use append flows to fan multiple Auto Loader sources into a single streaming table. The key insight is that you can use a Python for loop to dynamically generate flows at pipeline definition time.&lt;/P&gt;
&lt;P&gt;from pyspark import pipelines as dp&lt;BR /&gt;from pyspark.sql.functions import lit&lt;/P&gt;
&lt;P&gt;# Define your tenant list -- could also be loaded from a config table&lt;BR /&gt;tenants = ["tenantA", "tenantB", "tenantC"] # Scale to hundreds&lt;/P&gt;
&lt;P&gt;STORAGE_ACCOUNT = "youraccount"&lt;BR /&gt;COMMON_PATH = "data/events"&lt;/P&gt;
&lt;P&gt;# Create the shared target streaming table once&lt;BR /&gt;dp.create_streaming_table("all_tenant_events")&lt;/P&gt;
&lt;P&gt;# Dynamically create one append flow per tenant&lt;BR /&gt;for tenant in tenants:&lt;BR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.append_flow(target="all_tenant_events", name=f"ingest_{tenant}")&lt;BR /&gt;def create_flow(tenant_name=tenant):&lt;BR /&gt;path = f"abfss://prefix-{tenant_name}@{STORAGE_ACCOUNT}.dfs.core.windows.net/{COMMON_PATH}"&lt;BR /&gt;return (&lt;BR /&gt;spark.readStream.format("cloudFiles")&lt;BR /&gt;.option("cloudFiles.format", "json")&lt;BR /&gt;.option("cloudFiles.inferColumnTypes", "true")&lt;BR /&gt;.load(path)&lt;BR /&gt;.withColumn("tenant_name", lit(tenant_name))&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;Why this works well:&lt;/P&gt;
&lt;P&gt;- All tenants land in a single streaming table with a tenant_name discriminator column -- exactly what you want&lt;BR /&gt;- Each append flow maintains its own checkpoint, so a failure in one tenant does not block others&lt;BR /&gt;- The documentation states: "Any number of append flows can write to a particular target"&lt;BR /&gt;- A workspace supports up to 200 concurrent pipeline updates&lt;/P&gt;
&lt;P&gt;Scaling considerations:&lt;/P&gt;
&lt;P&gt;- Each append flow is a separate streaming micro-batch inside the pipeline. With hundreds of flows, you will need a cluster with enough cores/memory. Serverless compute with enhanced autoscaling is recommended.&lt;BR /&gt;- Consider using triggered mode rather than continuous processing. Schedule your pipeline to run periodically -- it processes all pending files across all tenants and then shuts down, which is more cost-effective.&lt;BR /&gt;- For file notification mode on Azure, there is a limit of 500 concurrent file notification pipelines per storage account using classic notifications. Using managed file events (cloudFiles.useManagedFileEvents = true) avoids this per-stream limit. Requires DBR 14.3 LTS+ and Unity Catalog external locations with file events enabled.&lt;/P&gt;
&lt;P&gt;Loading the tenant list dynamically:&lt;/P&gt;
&lt;P&gt;# Option A: Load from a config Delta table&lt;BR /&gt;tenant_df = spark.read.table("config.tenants")&lt;BR /&gt;tenants = [row.tenant_name for row in tenant_df.collect()]&lt;/P&gt;
&lt;P&gt;# Option B: List containers from Azure at pipeline definition time&lt;BR /&gt;from azure.storage.blob import BlobServiceClient&lt;BR /&gt;blob_service = BlobServiceClient(account_url=f"https://{STORAGE_ACCOUNT}.blob.core.windows.net", credential=...)&lt;BR /&gt;tenants = [c.name.replace("prefix-", "") for c in blob_service.list_containers(name_starts_with="prefix-")]&lt;/P&gt;
&lt;P&gt;Note: the tenant list is evaluated at pipeline definition time (when the pipeline starts). To pick up new tenants, restart/update the pipeline.&lt;/P&gt;
&lt;P&gt;Docs:&lt;BR /&gt;- Append Flows: &lt;A href="https://docs.databricks.com/en/ldp/flows.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/flows.html&lt;/A&gt;&lt;BR /&gt;- DLT Best Practices: &lt;A href="https://docs.databricks.com/en/ldp/best-practices.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/best-practices.html&lt;/A&gt;&lt;BR /&gt;- Auto Loader Production: &lt;A href="https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html" target="_blank"&gt;https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html&lt;/A&gt;&lt;BR /&gt;- File Notification Mode: &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-notification-mode" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-notification-mode&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;PATTERN 2: CONFIG-DRIVEN MULTIPLE PIPELINES VIA DATABRICKS ASSET BUNDLES&lt;/P&gt;
&lt;P&gt;If you prefer stronger isolation between tenants (separate failure domains, independent scheduling, per-tenant monitoring), use Databricks Asset Bundles to deploy one pipeline per tenant from a parameterized template.&lt;/P&gt;
&lt;P&gt;In your databricks.yml:&lt;/P&gt;
&lt;P&gt;variables:&lt;BR /&gt;tenant_name:&lt;BR /&gt;description: "Tenant identifier"&lt;BR /&gt;default: "tenantA"&lt;/P&gt;
&lt;P&gt;resources:&lt;BR /&gt;pipelines:&lt;BR /&gt;tenant_ingestion:&lt;BR /&gt;name: "ingestion-${var.tenant_name}"&lt;BR /&gt;target: "bronze"&lt;BR /&gt;configuration:&lt;BR /&gt;tenant_name: "${var.tenant_name}"&lt;BR /&gt;storage_account: "youraccount"&lt;BR /&gt;libraries:&lt;BR /&gt;- notebook:&lt;BR /&gt;path: ./notebooks/ingest_tenant.py&lt;/P&gt;
&lt;P&gt;Then deploy multiple instances:&lt;/P&gt;
&lt;P&gt;for tenant in tenantA tenantB tenantC; do&lt;BR /&gt;databricks bundle deploy --var="tenant_name=$tenant" --target prod&lt;BR /&gt;done&lt;/P&gt;
&lt;P&gt;This gives per-tenant failure isolation and independent scheduling, but at the cost of more compute resources (each pipeline has its own cluster).&lt;/P&gt;
&lt;P&gt;Docs:&lt;BR /&gt;- Asset Bundles: &lt;A href="https://docs.databricks.com/en/dev-tools/bundles/index.html" target="_blank"&gt;https://docs.databricks.com/en/dev-tools/bundles/index.html&lt;/A&gt;&lt;BR /&gt;- Bundle Variables: &lt;A href="https://docs.databricks.com/en/dev-tools/bundles/variables.html" target="_blank"&gt;https://docs.databricks.com/en/dev-tools/bundles/variables.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;PATTERN 3: RESTRUCTURE STORAGE (RECOMMENDED LONG-TERM)&lt;/P&gt;
&lt;P&gt;If you have influence over the storage layout, moving tenants from separate containers into directories within a single container is the cleanest long-term solution:&lt;/P&gt;
&lt;P&gt;abfss://data@account.dfs.core.windows.net/tenants/tenantA/events/...&lt;BR /&gt;abfss://data@account.dfs.core.windows.net/tenants/tenantB/events/...&lt;/P&gt;
&lt;P&gt;This unlocks:&lt;BR /&gt;- A single Auto Loader stream with recursiveFileLookup=true, using input_file_name() to extract the tenant name&lt;BR /&gt;- Simplified Unity Catalog governance with a single external location&lt;BR /&gt;- No concerns about per-container notification limits&lt;/P&gt;
&lt;P&gt;You can still maintain tenant-level access isolation using Azure RBAC with ADLS Gen2 ACLs on directories (since you have HNS enabled).&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;ANSWERING YOUR SPECIFIC QUESTIONS&lt;/P&gt;
&lt;P&gt;Q1: Is there a recommended pattern? Are there limits on concurrent Autoloader streams?&lt;/P&gt;
&lt;P&gt;The recommended pattern is Pattern 1 (append flows in a single DLT pipeline). There is no hard documented limit on Auto Loader streams within a single pipeline, but practical limits depend on cluster resources. For classic file notification mode, Azure has a limit of 500 per storage account. Using managed file events avoids this limit.&lt;/P&gt;
&lt;P&gt;Q2: Could Unity Catalog external locations or volumes abstract over multiple containers?&lt;/P&gt;
&lt;P&gt;Partially. You can create one external location per container, but each maps to exactly one storage path -- no wildcard or multi-container abstraction. The benefit is governance: you use a single storage credential (via Azure Access Connector with managed identity) referenced by all external locations, and Unity Catalog governs access via READ FILES permissions.&lt;/P&gt;
&lt;P&gt;Q3: How do others handle per-tenant ingestion at scale?&lt;/P&gt;
&lt;P&gt;The most common patterns are:&lt;BR /&gt;1. Single parameterized DLT pipeline with append flows (Pattern 1) -- best for cost efficiency when tenants share the same schema&lt;BR /&gt;2. Multiple parameterized pipelines via Asset Bundles (Pattern 2) -- best when tenants need isolated failure domains or have different schemas&lt;BR /&gt;3. Restructured storage with directory-based tenancy (Pattern 3) -- best long-term if you can influence the storage architecture&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;DOCUMENTATION REFERENCES&lt;/P&gt;
&lt;P&gt;- Auto Loader overview: &lt;A href="https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html" target="_blank"&gt;https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html&lt;/A&gt;&lt;BR /&gt;- Auto Loader options (Azure): &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/options" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/options&lt;/A&gt;&lt;BR /&gt;- Auto Loader file notification mode: &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-notification-mode" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/file-notification-mode&lt;/A&gt;&lt;BR /&gt;- DLT Append Flows: &lt;A href="https://docs.databricks.com/en/ldp/flows.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/flows.html&lt;/A&gt;&lt;BR /&gt;- DLT Limitations: &lt;A href="https://docs.databricks.com/en/ldp/limitations.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/limitations.html&lt;/A&gt;&lt;BR /&gt;- DLT Best Practices: &lt;A href="https://docs.databricks.com/en/ldp/best-practices.html" target="_blank"&gt;https://docs.databricks.com/en/ldp/best-practices.html&lt;/A&gt;&lt;BR /&gt;- External Locations (Azure): &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/connect/unity-catalog/cloud-storage/external-locations" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/connect/unity-catalog/cloud-storage/external-locations&lt;/A&gt;&lt;BR /&gt;- Asset Bundles: &lt;A href="https://docs.databricks.com/en/dev-tools/bundles/index.html" target="_blank"&gt;https://docs.databricks.com/en/dev-tools/bundles/index.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Hope this helps -- let me know if you have follow-up questions on any of these patterns!&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;</description>
      <pubDate>Sat, 07 Mar 2026 20:22:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-pattern-for-ingesting-data-from-hundreds-of-separate-adls/m-p/150097#M53240</guid>
      <dc:creator>SteveOstrowski</dc:creator>
      <dc:date>2026-03-07T20:22:50Z</dc:date>
    </item>
  </channel>
</rss>

