@gudurusreddy99
Root Cause
When performing stream-static joins, which means joining a streaming source with a static Delta table, Spark will scan the entire Delta table for each micro-batch if the static table is not filtered or incrementally joined.
https://www.linkedin.com/pulse/overcoming-databricks-dlt-streaming-table-join-limitations-roger-ding...
Solution Thinking:
1. Incremental Join Logic
Build your pipeline to join only new or changed data in the Delta table since the last batch with the current micro-batch from the streaming source.
Use watermarks and incremental filter conditions to limit joined records to the relevant time window or ID range. This will reduce the scan of the static Delta table.
https://community.databricks.com/t5/data-engineering/incremental-join-transformation-using-delta-liv...
2. Partitioning and Z-Ordering
Partition the static Delta table based on the join key, such as customer_id. This will help optimize data skipping during joins.
Apply Z-Ordering on the main join keys to improve clustering and further decrease file I/O and scan overhead.
3. Broadcast Joins (for Small Reference Tables)
If the Delta table is relatively small, enable broadcast joins in your query.
https://community.databricks.com/t5/technical-blog/top-5-tips-to-build-delta-live-tables-dlt-pipelin...
4. Use Change Data Capture (CDC) or Materialized Views
For Delta tables that change frequently or grow over time, implement CDC to process only updated rows in each batch.
Alternatively, you can materialize incremental results in intermediate tables and join streaming input with only the daily or hourly chunk, not the entire history.
https://stackoverflow.com/questions/75094547/creating-a-delta-live-table-for-reading-incremental-dat...
Next Steps:
Examine the partition structure of your current Delta table and make improvements as needed.
Add time or ID-based filters in your static or lookup table scans whenever possible.
If applicable, refactor your DLT pipeline to maintain gold or silver incremental tables as pre-filtered join sources.