cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Strategy to add new table base on silver data

Joe1912
New Contributor III

I have a merge function for streaming foreachBatch kind of
mergedf(df,i):

    merge_func_1(df,i)

     merge_func_2(df,i)

Then I want to add new merge_func_3 into it. 

Is there any best practices for this case? when streaming always runs, how can I process data from beginning for merge_func_3 without stopping streaming then create another temp job to run for func_3, then run streaming again with adding func_3

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @Joe1912When adding a new merge function to a streaming data pipeline, you have a few options. If you want the new function to apply only to recent data, you can modify your existing function to include it. To apply the function to historical data, you can run a batch job to reprocess and update your function. Alternatively, you can use Spark's checkpointing to recover the state and apply the new function to all data. The choice depends on your needs: batch job for simplicity with some data duplication or checkpointing for no duplication and state maintenance.

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Joe1912When adding a new merge function to a streaming data pipeline, you have a few options. If you want the new function to apply only to recent data, you can modify your existing function to include it. To apply the function to historical data, you can run a batch job to reprocess and update your function. Alternatively, you can use Spark's checkpointing to recover the state and apply the new function to all data. The choice depends on your needs: batch job for simplicity with some data duplication or checkpointing for no duplication and state maintenance.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Joe1912 , I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.




 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group