cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks DLT Joins: Streaming table join with Delta table is reading 2 Billion records per batch

gudurusreddy99
Visitor

Databricks DLT Joins: Streaming table join with Delta table is reading 2 Billion records from Delta Table for each and every Micro batch.

How to overcome this issue to not to read 2 Billion records for every micro batch.

Your suggestions and feedback will be highly appreciated.

 

Thanks

Srini

1 REPLY 1

ManojkMohan
Honored Contributor

@gudurusreddy99   

Root Cause
When performing stream-static joins, which means joining a streaming source with a static Delta table, Spark will scan the entire Delta table for each micro-batch if the static table is not filtered or incrementally joined.

https://www.linkedin.com/pulse/overcoming-databricks-dlt-streaming-table-join-limitations-roger-ding...

Solution Thinking:

1. Incremental Join Logic
Build your pipeline to join only new or changed data in the Delta table since the last batch with the current micro-batch from the streaming source.

Use watermarks and incremental filter conditions to limit joined records to the relevant time window or ID range. This will reduce the scan of the static Delta table.

https://community.databricks.com/t5/data-engineering/incremental-join-transformation-using-delta-liv...

2. Partitioning and Z-Ordering
Partition the static Delta table based on the join key, such as customer_id. This will help optimize data skipping during joins.

Apply Z-Ordering on the main join keys to improve clustering and further decrease file I/O and scan overhead.

3. Broadcast Joins (for Small Reference Tables)
If the Delta table is relatively small, enable broadcast joins in your query.

https://community.databricks.com/t5/technical-blog/top-5-tips-to-build-delta-live-tables-dlt-pipelin...

4. Use Change Data Capture (CDC) or Materialized Views
For Delta tables that change frequently or grow over time, implement CDC to process only updated rows in each batch.

Alternatively, you can materialize incremental results in intermediate tables and join streaming input with only the daily or hourly chunk, not the entire history.

https://stackoverflow.com/questions/75094547/creating-a-delta-live-table-for-reading-incremental-dat...

Next Steps:
Examine the partition structure of your current Delta table and make improvements as needed.

Add time or ID-based filters in your static or lookup table scans whenever possible.

If applicable, refactor your DLT pipeline to maintain gold or silver incremental tables as pre-filtered join sources.