cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Update code for a streaming job in Production

User16783853906
Contributor III

How to update a streaming job in production with minimal/no downtime when there are significant code changes that may not be compatible with the existing checkpoint state to resume the stream processing?

1 ACCEPTED SOLUTION

Accepted Solutions

Deepak_Bhutada
Contributor III
  1. Please understand the code changes will support the existing checkpoint or else you need to go with the new checkpoint. More information on the type of changes: https://docs.databricks.com/spark/latest/structured-streaming/production.html#types-of-changes
  2. If you are going with a new checkpoint then without mentioning any starting point for the source to fetch, the framework will fetch the whole data from the source. In that case, you should be in a position to handle the duplicates or else duplicates will be added to the sink. To handle the duplicates, you can implement dropDruplicates or merge or row_number based rank filtering of 1.

View solution in original post

5 REPLIES 5

aladda
Honored Contributor II

This will likely be use case/situation dependent. Can you provide an example of your current streaming setup and what kind of changes you anticipate that you'd like to perform with minimal downtime?

Deepak_Bhutada
Contributor III
  1. Please understand the code changes will support the existing checkpoint or else you need to go with the new checkpoint. More information on the type of changes: https://docs.databricks.com/spark/latest/structured-streaming/production.html#types-of-changes
  2. If you are going with a new checkpoint then without mentioning any starting point for the source to fetch, the framework will fetch the whole data from the source. In that case, you should be in a position to handle the duplicates or else duplicates will be added to the sink. To handle the duplicates, you can implement dropDruplicates or merge or row_number based rank filtering of 1.

Sandeep
Contributor III

Can you provide the source and sink type?

Himanshi
New Contributor III

I have the same scenario, I am using source type as parquet and sink type as delta in Azure Data Lake Gen2. I need to change the checkpoint location, how can we exclude existing files ?. Without using autoloader feature can we do that, please confirm .

Please help asap

Thanks

Anonymous
Not applicable

Thanks for the information, I will try to figure it out for more. Keep sharing such informative post keep suggesting such post.

MA Health Connector

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group