โ07-16-2025 10:05 AM
Hi all
Would appreciate your help on a topic.
when performing a join between a static and streaming dataframe is the latest version of the static table used at the start of the job or within each micro-batch.
Documentation doesnโt seem to specifically state what version of the static table they will use. I could probably just test this in a notebook but thought Iโd ask here as Iโm sure some of you will know this.
Thanks for any help.
โ07-16-2025 11:40 AM - edited โ07-16-2025 11:41 AM
Hi @Y2DTL ,
To be honest, I'm also surprised. I understood this documentation snippet exactly as you. Moreover, check below excellent blog entry from Bartosz Konieczny, where he analyzes quite similar scenario:
In case of Delta format we should get the latest version.
So maybe final question. Your static dataset is in Delta format? Because if not, then Bartosz blog explains this ๐
โ07-16-2025 10:35 AM
Hi @Y2DTL ,
Here's an answer from documentation:
A stream-static join joins the latest valid version of a Delta table (the static data) to a data stream using a stateless join.
When Databricks processes a micro-batch of data in a stream-static join, the latest valid version of data from the static Delta table joins with the records present in the current micro-batch. Because the join is stateless, you do not need to configure watermarking and can process results with low latency. The data in the static Delta table used in the join should be slowly-changing.
https://docs.databricks.com/gcp/en/transform/join#stream-static-joins
โ07-16-2025 10:49 AM
Thanks, Really appreciate the reply.
I have read that in the documentation which I assumed meant that for every micro batch processed in the stream, the static table would be updated and the new latest version would be used.
I have just tested this in a notebook, created a table, started my stream, updated my static table, processed a new microbatch and it was still using the snapshot of the table when i initialised the job not the new version after the new microbatch.
Now im even more confused
โ07-16-2025 11:40 AM - edited โ07-16-2025 11:41 AM
Hi @Y2DTL ,
To be honest, I'm also surprised. I understood this documentation snippet exactly as you. Moreover, check below excellent blog entry from Bartosz Konieczny, where he analyzes quite similar scenario:
In case of Delta format we should get the latest version.
So maybe final question. Your static dataset is in Delta format? Because if not, then Bartosz blog explains this ๐
โ07-16-2025 11:52 AM
So Iโve retested and I obviously made a mistake first time around. Accept my humble apology.
Seems that the static delta table does refresh with each microbatch. So the most recent version of the table is joined to the steaming df with each new microbatch.
Really appreciate your help @szymon_dybczak. Thanks a lot.
This has been bugging me for a few days and finally I have an answer.
โ07-16-2025 01:40 PM
Hi @Y2DTL ,
Great that you figure it out! And if you think my answer was helpful, please consider marking it as solution.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now