cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Left Outer Join returns an Inner Join in Delta Live Tables

dofre
New Contributor II

In our Delta Live Table pipeline I am simply joining two streaming tables to a new streaming table.

We use the following code:

 

@Dlt.create_table()
def fact_event_faults():
    events = dlt.read_stream('event_list').withWatermark('TimeStamp', '4 hours')
    files = dlt.read_stream('file_list').withWatermark('Trigger_DateTime', '4 hours')
    event_faults = events.join(files, 
      (
        (events.TimeStamp >= (files.Trigger_DateTime - f.expr('INTERVAL 1 SECONDS'))) &
        (events.TimeStamp <= (files.Trigger_DateTime + f.expr('INTERVAL 1 SECONDS'))) &
        (events.Unique_Installation_ID == files.Unique_Installation_ID)
      ),
      how='leftouter')
    
    return event_faults

 

When running this, the event_list table has 12K records, the file_list has 4000 records but the event_faults table only has 76 records.
4bba99f9-1293-42f9-ab53-f869af77a877.jpg

In an left outer join I would expect at least 12K records in the resulting table.

Executing the same query in a notebook without streaming and without watermarking, returns the expected 12K+ rows.

What am I missing here?

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @dofreIt appears that you’ve encountered an unexpected discrepancy in the results of your Delta Live Table pipeline.

Let’s explore some potential reasons for this behavior:

  1. Streaming and Watermarking:

    • In your pipeline, you’re working with streaming tables, which have specific characteristics compared to batch processing.
    • Streaming tables process data incrementally as new records arrive, and they use watermarks to manage event time.
    • The watermark defines a threshold time beyond which late data is considered and processed.
    • When using watermarks, the system ensures that only data within the specified time window is considered for joins and aggregations.
    • Make sure that the watermark settings align with your expectations. If the watermark is set too aggressively, it might exclude some records from the join.
  2. Event Time and Late Data:

    • In streaming scenarios, event time (timestamps) plays a crucial role.
    • The join condition you’ve defined involves comparing timestamps (TimeStamp and Trigger_DateTime).
    • Ensure that the timestamps are correctly aligned and represent the actual event time.
    • Late data (data arriving after the watermark threshold) might not be included in the join results. Verify if any late data is being dropped.
  3. Unique Installation ID:

    • Your join condition includes matching the Unique_Installation_ID between the two tables.
    • Confirm that the values of Unique_Installation_ID are consistent across both tables.
    • If there are discrepancies or missing values, it could affect the join results.
  4. Left Outer Join Behavior:

    • You mentioned that you expected at least 12K records in the resulting table due to the left outer join.
    • In a left outer join, all records from the left table (events) are included, along with matching records from the right table (files).
    • However, if there are no matching records in the right table for a given events record, the result will still include the left record with null values for the right table columns.
    • Double-check if the join conditions are met for all records. If not, investigate why some records are not matching.
  5. Debugging Steps:

    • To troubleshoot further, consider the following steps:
      • Inspect the data in both events and files tables. Look for any anomalies or unexpected values.
      • Check if there are any duplicate records or data quality issues.
      • Review the watermark settings and ensure they align with your use case.
      • Temporarily disable watermarking and streaming to test the join behavior without those factors.
  6. Notebook vs. Streaming:

    • You mentioned that executing the same query in a notebook (without streaming and watermarking) returns the expected 12K+ rows.
    • This suggests that the issue might be related to streaming or watermarking.
    • Compare the execution plans and query results between the notebook and the streaming pipeline to identify any differences.

If you need further assistance, feel free to provide additional details or share any relevant logs for deeper analysis. 🕵️‍♂️🔍

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group