Databricks

User16783853906 · ‎06-23-2021

How to update a streaming job in production with minimal/no downtime when there are significant code changes that may not be compatible with the existing checkpoint state to resume the stream processing?

Deepak_Bhutada · ‎09-16-2021

Please understand the code changes will support the existing checkpoint or else you need to go with the new checkpoint. More information on the type of changes: https://docs.databricks.com/spark/latest/structured-streaming/production.html#types-of-changes
If you are going with a new checkpoint then without mentioning any starting point for the source to fetch, the framework will fetch the whole data from the source. In that case, you should be in a position to handle the duplicates or else duplicates will be added to the sink. To handle the duplicates, you can implement dropDruplicates or merge or row_number based rank filtering of 1.

View solution in original post

Anand_Ladda · ‎06-23-2021

This will likely be use case/situation dependent. Can you provide an example of your current streaming setup and what kind of changes you anticipate that you'd like to perform with minimal downtime?

Deepak_Bhutada · ‎09-16-2021

Please understand the code changes will support the existing checkpoint or else you need to go with the new checkpoint. More information on the type of changes: https://docs.databricks.com/spark/latest/structured-streaming/production.html#types-of-changes
If you are going with a new checkpoint then without mentioning any starting point for the source to fetch, the framework will fetch the whole data from the source. In that case, you should be in a position to handle the duplicates or else duplicates will be added to the sink. To handle the duplicates, you can implement dropDruplicates or merge or row_number based rank filtering of 1.

Sandeep · ‎11-10-2021

Can you provide the source and sink type?

Himanshi · ‎07-21-2022

I have the same scenario, I am using source type as parquet and sink type as delta in Azure Data Lake Gen2. I need to change the checkpoint location, how can we exclude existing files ?. Without using autoloader feature can we do that, please confirm .

Please help asap

Thanks

Anonymous · ‎07-25-2022

Thanks for the information, I will try to figure it out for more. Keep sharing such informative post keep suggesting such post.

MA Health Connector

Databricks

Update code for a streaming job in Production

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI