cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Handling Dropped Records in Delta Live Tables with Watermark - Need Optimization Strategy

minhhung0507
New Contributor II

Hi Databricks Community,I'm encountering an issue with watermarks in Delta Live Tables that's causing data loss in my streaming pipeline. Let me explain my specific problem:

Current Situation

I've implemented watermarks for stateful processing in my Delta Live Tables pipeline. However, I'm observing that some records are being dropped, as evidenced by the monitoring graphs I've attached. The graphs show:

  • Spikes in the number of late rows dropped by watermark
  • One graph shows a peak of around 20 records dropped
  • Another graph displays multiple peaks with up to 250 records being dropped

Questions

  1. What are the recommended strategies to optimize watermark configuration to minimize data loss while maintaining processing efficiency?
  2. Is there a way to:
    • Capture these dropped records for later processing?
    • Implement a recovery mechanism to reprocess these dropped records?
    • Update the final table with these recovered records without causing duplicates or inconsistencies?

Additional Context

I understand that watermarks are essential for managing state and preventing out-of-memory issues, but I need to ensure data completeness for my use case. Any guidance on best practices or alternative approaches would be greatly appreciated.

2 REPLIES 2

Walter_C
Databricks Employee
Databricks Employee

The watermark threshold determines how late data can be before it is considered too late and dropped. A smaller threshold results in lower latency but increases the likelihood of dropping late records. Conversely, a larger threshold reduces data loss but may increase latency and require more resources

Continuously monitor your streaming pipeline to understand the patterns of late data and adjust the watermark threshold accordingly. This can help balance between latency and data completeness.

To capture records that are dropped due to being late, you can use the withEventTimeOrder option. This ensures that the initial snapshot is processed in event time order, reducing the likelihood of records being dropped as late events.

 

 

.option("withEventTimeOrder", "true")

You can implement a recovery mechanism by storing the dropped records in a separate Delta table for later processing. This can be achieved by configuring your streaming query to write late records to a different sink.

 

minhhung0507
New Contributor II

 

Dear @Walter_C, thank you for your detailed response regarding watermark handling in Delta Live Tables (DLT). I appreciate the guidance provided, but I would like further clarification on a couple of points related to our use case.

1. Auto-Saving Dropped Records Due to Watermark

We are currently using the APPLY CHANGES API in Delta Live Tables for our pipeline. Is there a built-in mechanism or recommended approach to automatically save records that are dropped due to watermark thresholds? Specifically, we are looking for:

  • Step-by-step guidelines or documentation on how to implement this functionality.
  • Code examples or configurations that demonstrate how to capture these late records into a separate Delta table for later processing.

If this is not natively supported, could you recommend an alternative approach to achieve this while maintaining pipeline efficiency?

2. Auto-Triggering Updates for Dropped Records

Once the dropped records are saved, we aim to reprocess them and update the final table automatically. Could you provide guidance on:

  • How to design a recovery mechanism that triggers updates for these saved records without manual intervention?
  • The best practices to ensure these updates do not introduce duplicates or inconsistencies in the final table.
  • Any optimizations we can apply to minimize resource usage and processing time during this recovery process.

We would greatly appreciate any additional insights or references to best practices for managing late-arriving data in DLT pipelines.Looking forward to your response!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group