cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best Practices for implementing DLT, Autoloader in Workflows

Swathik
New Contributor II

I am in the process of designing a Medallion architecture where the data sources include REST API calls, JSON files, SQL Server, and Azure Event Hubs.

For the Silver and Gold layers, I plan to leverage Delta Live Tables (DLT). However, I am seeking guidance on the most effective approach to implement the Bronze layer, particularly in combination with Autoloader.

Specifically, for JSON file ingestion, I intend to use Autoloader with the trigger(availableNow=True) option. My understanding is that this option is not currently supported within DLT pipelines.

Could you please advise on recommended practices for implementing the Bronze layer to handle both batch and streaming ingestion scenarios in DLT, while ensuring compatibility with Autoloader?

 

1 ACCEPTED SOLUTION

Accepted Solutions

mark_ott
Databricks Employee
Databricks Employee

The optimal approach for implementing the Bronze layer in a Medallion architecture with Delta Live Tables (DLT) involves balancing batch and streaming ingestion patterns, especially when combining DLT and Autoloader. The trigger(availableNow=True) option for Autoloader is currently not supported within DLT pipelines, meaning direct batch-style ingestion using this trigger must instead be orchestrated outside DLT or managed differently.​

Bronze Layer Best Practices

  • For most JSON file ingestion scenarios, use Autoloader with streaming mode in DLT. This provides strong support for both file arrival triggers and continuous streaming while allowing for schema evolution and integration with quality checks.​

  • Store the ingested raw data as a Delta table to maximize compatibility with downstream Silver and Gold transformations.​

  • For mixed batch and streaming requirements, design your Bronze ingestion so that:

    • Azure Event Hubs and other continuous sources use Structured Streaming within DLT.

    • REST API calls and bulk JSON files can be orchestrated as external batch processes that land files in a storage location watched by Autoloader. Even though availableNow is not available in DLT, you can still process new files incrementally as they arrive or set up a separate process for batch-triggered ingestion.​

Handling Batch and Streaming Together

  • If you require the semantics of batch processing (one-off or scheduled ingestion of a discrete set of files), consider running a separate Spark job (outside of DLT) that uses trigger(availableNow=True) and writes the output to a Bronze Delta table, which you then register as a DLT source table for your pipeline.​

  • For ongoing streaming or micro-batch ingestion, define your Bronze tables using the regular streaming capabilities of DLT connected to Autoloader. This ensures both real-time and near real-time data are processed efficiently and land in the same Bronze context.​

Summary Table: Ingestion Options

Source Type Recommended Ingestion in Bronze Layer Batch/Streaming Integration with DLT Autoloader
JSON Files (Batch) Spark job with Autoloader, trigger(availableNow), write to Delta Batch Register output as source table
JSON Files (Streaming) DLT streaming table using Autoloader Streaming Full DLT support, no availableNow
Event Hubs Structured Streaming in DLT Streaming Native DLT support
REST API Orchestrate API pulls, land files, ingest as above Batch/Streaming External orchestration, then DLT
SQL Server Periodic extract or change data capture, land to files or Delta Batch/Streaming External ingest, then DLT
 
 

By decoupling batch "catch-up" and ongoing streaming in the Bronze layer, you ensure compatibility, recoverability, and optimal use of DLT's features.​

View solution in original post

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

The optimal approach for implementing the Bronze layer in a Medallion architecture with Delta Live Tables (DLT) involves balancing batch and streaming ingestion patterns, especially when combining DLT and Autoloader. The trigger(availableNow=True) option for Autoloader is currently not supported within DLT pipelines, meaning direct batch-style ingestion using this trigger must instead be orchestrated outside DLT or managed differently.​

Bronze Layer Best Practices

  • For most JSON file ingestion scenarios, use Autoloader with streaming mode in DLT. This provides strong support for both file arrival triggers and continuous streaming while allowing for schema evolution and integration with quality checks.​

  • Store the ingested raw data as a Delta table to maximize compatibility with downstream Silver and Gold transformations.​

  • For mixed batch and streaming requirements, design your Bronze ingestion so that:

    • Azure Event Hubs and other continuous sources use Structured Streaming within DLT.

    • REST API calls and bulk JSON files can be orchestrated as external batch processes that land files in a storage location watched by Autoloader. Even though availableNow is not available in DLT, you can still process new files incrementally as they arrive or set up a separate process for batch-triggered ingestion.​

Handling Batch and Streaming Together

  • If you require the semantics of batch processing (one-off or scheduled ingestion of a discrete set of files), consider running a separate Spark job (outside of DLT) that uses trigger(availableNow=True) and writes the output to a Bronze Delta table, which you then register as a DLT source table for your pipeline.​

  • For ongoing streaming or micro-batch ingestion, define your Bronze tables using the regular streaming capabilities of DLT connected to Autoloader. This ensures both real-time and near real-time data are processed efficiently and land in the same Bronze context.​

Summary Table: Ingestion Options

Source Type Recommended Ingestion in Bronze Layer Batch/Streaming Integration with DLT Autoloader
JSON Files (Batch) Spark job with Autoloader, trigger(availableNow), write to Delta Batch Register output as source table
JSON Files (Streaming) DLT streaming table using Autoloader Streaming Full DLT support, no availableNow
Event Hubs Structured Streaming in DLT Streaming Native DLT support
REST API Orchestrate API pulls, land files, ingest as above Batch/Streaming External orchestration, then DLT
SQL Server Periodic extract or change data capture, land to files or Delta Batch/Streaming External ingest, then DLT
 
 

By decoupling batch "catch-up" and ongoing streaming in the Bronze layer, you ensure compatibility, recoverability, and optimal use of DLT's features.​

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now