Databricks Community

Y2DTL · ‎07-16-2025

Hi all

Would appreciate your help on a topic.

when performing a join between a static and streaming dataframe is the latest version of the static table used at the start of the job or within each micro-batch.

Documentation doesn’t seem to specifically state what version of the static table they will use. I could probably just test this in a notebook but thought I’d ask here as I’m sure some of you will know this.

Thanks for any help.

szymon_dybczak · ‎07-16-2025

Hi @Y2DTL ,

To be honest, I'm also surprised. I understood this documentation snippet exactly as you. Moreover, check below excellent blog entry from Bartosz Konieczny, where he analyzes quite similar scenario:

https://www.waitingforcode.com/apache-spark-structured-streaming/broadcast-join-changing-static-data...

In case of Delta format we should get the latest version.

So maybe final question. Your static dataset is in Delta format? Because if not, then Bartosz blog explains this 🙂

View solution in original post

szymon_dybczak · ‎07-16-2025

Hi @Y2DTL ,

Here's an answer from documentation:

A stream-static join joins the latest valid version of a Delta table (the static data) to a data stream using a stateless join.

When Databricks processes a micro-batch of data in a stream-static join, the latest valid version of data from the static Delta table joins with the records present in the current micro-batch. Because the join is stateless, you do not need to configure watermarking and can process results with low latency. The data in the static Delta table used in the join should be slowly-changing.

https://docs.databricks.com/gcp/en/transform/join#stream-static-joins

Y2DTL · ‎07-16-2025

Thanks, Really appreciate the reply.

I have read that in the documentation which I assumed meant that for every micro batch processed in the stream, the static table would be updated and the new latest version would be used.

I have just tested this in a notebook, created a table, started my stream, updated my static table, processed a new microbatch and it was still using the snapshot of the table when i initialised the job not the new version after the new microbatch.

Now im even more confused

szymon_dybczak · ‎07-16-2025

Hi @Y2DTL ,

To be honest, I'm also surprised. I understood this documentation snippet exactly as you. Moreover, check below excellent blog entry from Bartosz Konieczny, where he analyzes quite similar scenario:

https://www.waitingforcode.com/apache-spark-structured-streaming/broadcast-join-changing-static-data...

In case of Delta format we should get the latest version.

So maybe final question. Your static dataset is in Delta format? Because if not, then Bartosz blog explains this 🙂

Y2DTL · ‎07-16-2025

So I’ve retested and I obviously made a mistake first time around. Accept my humble apology.

Seems that the static delta table does refresh with each microbatch. So the most recent version of the table is joined to the steaming df with each new microbatch.

Really appreciate your help @szymon_dybczak. Thanks a lot.

This has been bugging me for a few days and finally I have an answer.