Databricks Community

dofre · ‎03-06-2024

In our Delta Live Table pipeline I am simply joining two streaming tables to a new streaming table.

We use the following code:

@Dlt.create_table()
def fact_event_faults():
    events = dlt.read_stream('event_list').withWatermark('TimeStamp', '4 hours')
    files = dlt.read_stream('file_list').withWatermark('Trigger_DateTime', '4 hours')
    event_faults = events.join(files, 
      (
        (events.TimeStamp >= (files.Trigger_DateTime - f.expr('INTERVAL 1 SECONDS'))) &
        (events.TimeStamp <= (files.Trigger_DateTime + f.expr('INTERVAL 1 SECONDS'))) &
        (events.Unique_Installation_ID == files.Unique_Installation_ID)
      ),
      how='leftouter')
    
    return event_faults

When running this, the event_list table has 12K records, the file_list has 4000 records but the event_faults table only has 76 records.

In an left outer join I would expect at least 12K records in the resulting table.

Executing the same query in a notebook without streaming and without watermarking, returns the expected 12K+ rows.

What am I missing here?

Kaniz_Fatma · ‎03-07-2024

Hi @dofre, It appears that you’ve encountered an unexpected discrepancy in the results of your Delta Live Table pipeline.

Let’s explore some potential reasons for this behavior:

Streaming and Watermarking:
- In your pipeline, you’re working with streaming tables, which have specific characteristics compared to batch processing.
- Streaming tables process data incrementally as new records arrive, and they use watermarks to manage event time.
- The watermark defines a threshold time beyond which late data is considered and processed.
- When using watermarks, the system ensures that only data within the specified time window is considered for joins and aggregations.
- Make sure that the watermark settings align with your expectations. If the watermark is set too aggressively, it might exclude some records from the join.
Event Time and Late Data:
- In streaming scenarios, event time (timestamps) plays a crucial role.
- The join condition you’ve defined involves comparing timestamps (TimeStamp and Trigger_DateTime).
- Ensure that the timestamps are correctly aligned and represent the actual event time.
- Late data (data arriving after the watermark threshold) might not be included in the join results. Verify if any late data is being dropped.
Unique Installation ID:
- Your join condition includes matching the Unique_Installation_ID between the two tables.
- Confirm that the values of Unique_Installation_ID are consistent across both tables.
- If there are discrepancies or missing values, it could affect the join results.
Left Outer Join Behavior:
- You mentioned that you expected at least 12K records in the resulting table due to the left outer join.
- In a left outer join, all records from the left table (events) are included, along with matching records from the right table (files).
- However, if there are no matching records in the right table for a given events record, the result will still include the left record with null values for the right table columns.
- Double-check if the join conditions are met for all records. If not, investigate why some records are not matching.
Debugging Steps:
- To troubleshoot further, consider the following steps:
  - Inspect the data in both events and files tables. Look for any anomalies or unexpected values.
  - Check if there are any duplicate records or data quality issues.
  - Review the watermark settings and ensure they align with your use case.
  - Temporarily disable watermarking and streaming to test the join behavior without those factors.
Notebook vs. Streaming:
- You mentioned that executing the same query in a notebook (without streaming and watermarking) returns the expected 12K+ rows.
- This suggests that the issue might be related to streaming or watermarking.
- Compare the execution plans and query results between the notebook and the streaming pipeline to identify any differences.