cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Stream/static Join

Y2DTL
New Contributor III

Hi all

Would appreciate your help on a topic.

when performing a join between a static and streaming dataframe is the latest version of the  static table used at the start of the job or within each micro-batch. 

Documentation doesnโ€™t seem to specifically state what version of the static table they will use. I could probably just test this in a notebook but thought Iโ€™d ask here as Iโ€™m sure some of you will know this. 

Thanks for any help. 

 

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @Y2DTL ,

To be honest, I'm also surprised. I understood this documentation snippet exactly as you. Moreover, check below excellent blog entry from Bartosz Konieczny, where he analyzes quite similar scenario:

https://www.waitingforcode.com/apache-spark-structured-streaming/broadcast-join-changing-static-data...

In case of Delta format we should get the latest version. 

So maybe final question. Your static dataset is in Delta format? Because if not, then Bartosz blog explains this ๐Ÿ™‚

View solution in original post

5 REPLIES 5

szymon_dybczak
Esteemed Contributor III

Hi @Y2DTL ,

Here's an answer from documentation: 

 

A stream-static join joins the latest valid version of a Delta table (the static data) to a data stream using a stateless join.

 

When Databricks processes a micro-batch of data in a stream-static join, the latest valid version of data from the static Delta table joins with the records present in the current micro-batch. Because the join is stateless, you do not need to configure watermarking and can process results with low latency. The data in the static Delta table used in the join should be slowly-changing.

https://docs.databricks.com/gcp/en/transform/join#stream-static-joins

Y2DTL
New Contributor III

Thanks, Really appreciate the reply. 

I have read that in the documentation which I assumed meant that for every micro batch processed in the stream, the static table would be updated and the new latest version would be used.

I have just tested this in a notebook, created a table, started my stream, updated my static table, processed a new microbatch and it was still using the snapshot of the table when i initialised the job not the new version after the new microbatch.

Now im even more confused 

 

szymon_dybczak
Esteemed Contributor III

Hi @Y2DTL ,

To be honest, I'm also surprised. I understood this documentation snippet exactly as you. Moreover, check below excellent blog entry from Bartosz Konieczny, where he analyzes quite similar scenario:

https://www.waitingforcode.com/apache-spark-structured-streaming/broadcast-join-changing-static-data...

In case of Delta format we should get the latest version. 

So maybe final question. Your static dataset is in Delta format? Because if not, then Bartosz blog explains this ๐Ÿ™‚

Y2DTL
New Contributor III

So Iโ€™ve retested and I obviously made a mistake first time around. Accept my humble apology.  

Seems that the static delta table does refresh with each microbatch. So the most recent version of the table is joined to the steaming df with each new microbatch. 

Really appreciate your help @szymon_dybczak. Thanks a lot. 

This has been bugging me for a few days and finally I have an answer. 

szymon_dybczak
Esteemed Contributor III

Hi @Y2DTL ,

Great that you figure it out! And if you think my answer was helpful, please consider marking it as solution.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now