- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 02:52 PM
How to update a streaming job in production with minimal/no downtime when there are significant code changes that may not be compatible with the existing checkpoint state to resume the stream processing?
- Labels:
-
Code
-
Stream Processing
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2021 03:38 AM
- Please understand the code changes will support the existing checkpoint or else you need to go with the new checkpoint. More information on the type of changes: https://docs.databricks.com/spark/latest/structured-streaming/production.html#types-of-changes
- If you are going with a new checkpoint then without mentioning any starting point for the source to fetch, the framework will fetch the whole data from the source. In that case, you should be in a position to handle the duplicates or else duplicates will be added to the sink. To handle the duplicates, you can implement dropDruplicates or merge or row_number based rank filtering of 1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 04:10 PM
This will likely be use case/situation dependent. Can you provide an example of your current streaming setup and what kind of changes you anticipate that you'd like to perform with minimal downtime?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2021 03:38 AM
- Please understand the code changes will support the existing checkpoint or else you need to go with the new checkpoint. More information on the type of changes: https://docs.databricks.com/spark/latest/structured-streaming/production.html#types-of-changes
- If you are going with a new checkpoint then without mentioning any starting point for the source to fetch, the framework will fetch the whole data from the source. In that case, you should be in a position to handle the duplicates or else duplicates will be added to the sink. To handle the duplicates, you can implement dropDruplicates or merge or row_number based rank filtering of 1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-10-2021 07:26 AM
Can you provide the source and sink type?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-21-2022 03:33 AM
I have the same scenario, I am using source type as parquet and sink type as delta in Azure Data Lake Gen2. I need to change the checkpoint location, how can we exclude existing files ?. Without using autoloader feature can we do that, please confirm .
Please help asap
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2022 01:51 AM
Thanks for the information, I will try to figure it out for more. Keep sharing such informative post keep suggesting such post.

