cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Stream to static join - late arriving records

yit337
Contributor
I have a stream to static join, but some of the rows in the static table arrive later than the linked rows in the stream.
What is the default behaviour if a record in the stream hasn't joined a record in the static table? Is it lost forever?
How is this resolved? By setting watermarks on the stream?
1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @yit337,

Your question sounds similar to this post. Have you checked it?

It also has a blog that you may find useful.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

2 REPLIES 2

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @yit337,

Your question sounds similar to this post. Have you checked it?

It also has a blog that you may find useful.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @yit337,

This is an important topic to understand, so let me walk through the mechanics in detail.

HOW STREAM-STATIC JOINS WORK

In a stream-static join, each micro-batch of streaming data is joined against the static DataFrame. The key behavior depends on the format of your static data:

1. If your static table is a Delta table, Databricks reads the latest valid version of the Delta table for each micro-batch. This means if a dimension row arrives in the static table before the next micro-batch fires, the join will pick it up. The Databricks documentation states: "A stream-static join joins the latest valid version of a Delta table (the static data) to a data stream using a stateless join."

2. If your static table is NOT Delta (e.g. Parquet, CSV, JSON), the file list is cached when the query starts and is never refreshed. New files added to the directory will be invisible to the running query. This is a critical difference.

WHAT HAPPENS TO UNMATCHED STREAM RECORDS

If a stream record does not find a match in the static table at the time the micro-batch runs:

- With an INNER JOIN: the stream record is dropped from the output. It is processed and gone, it will not be retried in a future micro-batch. So yes, it is effectively "lost" from the join output.

- With a LEFT OUTER JOIN (stream on the left): the stream record appears in the output with nulls for the static columns. It is still emitted once and will not be re-evaluated when the static data later arrives.

In either case, the stream record is processed exactly once per micro-batch. There is no built-in mechanism to "go back" and re-join stream records that previously missed a match.

DO WATERMARKS HELP HERE?

No. Watermarks are relevant for stream-stream joins where Spark needs to know how long to buffer state while waiting for matching records from the other stream. Stream-static joins are stateless, so watermarks do not apply and will not help with the late-arriving dimension problem.

HOW TO HANDLE LATE-ARRIVING STATIC DATA

Here are a few patterns that work:

1. ENSURE DIMENSIONS ARRIVE FIRST
The simplest approach is to make sure the static (dimension) table is populated before the streaming facts arrive. If you control the pipeline, sequence your loads so that dimension data lands first. In a Lakeflow Spark Declarative Pipeline (SDP), you can set up table dependencies to enforce this ordering.

2. USE A LEFT JOIN + DOWNSTREAM RECONCILIATION
Use a LEFT JOIN so that unmatched stream records are still emitted (with null dimension columns). Then run a periodic batch job or a second streaming pass to "reconcile" those records once the dimension data has arrived. For example:

-- Periodic reconciliation query
MERGE INTO enriched_facts AS target
USING (
SELECT f.*, d.dim_col1, d.dim_col2
FROM enriched_facts f
JOIN dim_table d ON f.dim_key = d.dim_key
WHERE f.dim_col1 IS NULL
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET
target.dim_col1 = source.dim_col1,
target.dim_col2 = source.dim_col2

3. CONVERT TO A STREAM-STREAM JOIN
If the dimension data is also arriving as a stream (or can be treated as one using Auto Loader or readStream on a Delta table), you can use a stream-stream join with watermarks. This gives Spark the ability to buffer and wait for late-arriving records from either side. The trade-off is added complexity and state management.

4. REPROCESS THE STREAM
If late-arriving dimensions are rare, you could periodically reprocess the affected time range from your streaming source by resetting the checkpoint for that window, though this approach requires careful handling to avoid duplicates.

SUMMARY

- Delta tables as the static side: latest version is read each micro-batch, which helps when dimensions arrive between micro-batches.
- Non-Delta static sources: file list is cached at query start, no refresh.
- Unmatched stream records in an inner join are dropped permanently.
- Watermarks do not apply to stream-static joins.
- The recommended pattern for late-arriving dimensions is a LEFT JOIN with a downstream reconciliation step, or ensuring dimension data arrives first.

Reference documentation:
https://docs.databricks.com/en/transform/join.html#stream-static-joins

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.