cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Update code for a streaming job in Production

User16783853906
Contributor III

How to update a streaming job in production with minimal/no downtime when there are significant code changes that may not be compatible with the existing checkpoint state to resume the stream processing?

1 ACCEPTED SOLUTION

Accepted Solutions

Deepak_Bhutada
Contributor III
  1. Please understand the code changes will support the existing checkpoint or else you need to go with the new checkpoint. More information on the type of changes: https://docs.databricks.com/spark/latest/structured-streaming/production.html#types-of-changes
  2. If you are going with a new checkpoint then without mentioning any starting point for the source to fetch, the framework will fetch the whole data from the source. In that case, you should be in a position to handle the duplicates or else duplicates will be added to the sink. To handle the duplicates, you can implement dropDruplicates or merge or row_number based rank filtering of 1.

View solution in original post

5 REPLIES 5

Anand_Ladda
Honored Contributor II

This will likely be use case/situation dependent. Can you provide an example of your current streaming setup and what kind of changes you anticipate that you'd like to perform with minimal downtime?

Deepak_Bhutada
Contributor III
  1. Please understand the code changes will support the existing checkpoint or else you need to go with the new checkpoint. More information on the type of changes: https://docs.databricks.com/spark/latest/structured-streaming/production.html#types-of-changes
  2. If you are going with a new checkpoint then without mentioning any starting point for the source to fetch, the framework will fetch the whole data from the source. In that case, you should be in a position to handle the duplicates or else duplicates will be added to the sink. To handle the duplicates, you can implement dropDruplicates or merge or row_number based rank filtering of 1.

Sandeep
Contributor III

Can you provide the source and sink type?

Himanshi
New Contributor III

I have the same scenario, I am using source type as parquet and sink type as delta in Azure Data Lake Gen2. I need to change the checkpoint location, how can we exclude existing files ?. Without using autoloader feature can we do that, please confirm .

Please help asap

Thanks

Anonymous
Not applicable

Thanks for the information, I will try to figure it out for more. Keep sharing such informative post keep suggesting such post.

MA Health Connector

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.