cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Streaming job update

thibault
Contributor III

Hi! 

Using bundles, I want to update a running streaming job. All good until the new job gets deployed, but then the job needs to be stopped manually so that the new assets are used and it has to be started manually. This might lead to the job running an old version if the job is not stopped & started again manually.

How do you typically handle updates to streaming jobs automatically?

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

To handle updates to streaming jobs automatically and ensure that new code or assets are picked up without requiring manual stops and restarts, you typically use one of the following approaches depending on your streaming framework and deployment environment:

Best Practice Approaches

  • Parallel Pipeline Deployment: Some managed platforms (like Google Dataflow) support "parallel pipeline updates," where a new version of the job is spun up in parallel with the old one, and the old job is drained after a set duration. This approach minimizes downtime and reduces manual steps, although it can temporarily duplicate data processing if not carefully managed. The new job must have a different name, and downstream consumers must handle duplicate or partial data that may result during the switchover.โ€‹

  • Draining and Restart Automation: Where in-place updating or parallel replacement is not supported, automate the drain, stop, and start steps by using CI/CD automation or orchestrators (like Airflow, Jenkins, or built-in scheduler APIs of your cloud provider or streaming engine). These automation scripts or workflows can ensure that the current job is stopped safely after or while a new one is deployed, then started immediately, minimizing human error and latency.โ€‹

  • Stateful Streaming Upgrades: Frameworks such as Apache Flink, Kafka Streams, and Spark Structured Streaming generally require stopping the existing pipeline and starting a new one with the updated assets. For zero-downtime, this process can be scripted. Some frameworks support "savepoints" or checkpoints that can be taken before shutdown, and then restored with the new job, limiting data loss or downtime.

  • In-flight Updates (where available): Some frameworks/platforms offer in-flight or rolling updates for streaming jobs, especially when only configuration or resource values are changed (not code or dependencies). For example, auto-scaling or light config updates may be safely applied on a running job, but code or asset changes usually require job restart.โ€‹

Tools and Automation Suggestions

  • Use CI/CD pipelines to automate deployment, draining, stopping, and starting of updated stream jobs.

  • Leverage job orchestration platforms with dependency/trigger management.

  • Where available, use cloud service APIs for jobs (such as Dataflowโ€™s parallel updates or AWS Glue Streaming Job update APIs) to script the update process.

  • Always ensure consumers and downstream systems are designed to handle duplicates or short gaps during transition windows.

Additional Considerations

  • Be aware of data processing guarantees and possible duplicate/partial data during parallel runs or quick restarts, and plan your sinks/outputs accordingly (idempotent writes or deduplication logic).

  • Monitor lag, throughput, and state hydration to ensure the post-update service resumes smoothly.

  • For frameworks not supporting direct in-place updates, consider implementing blue/green deployment patterns for pipelines.

In summary, you should automate the deployment and (if needed) the stop/start or drain/restart phases as much as possible and use any available managed features for rolling or parallel updates, to avoid manual intervention and reduce risk of running outdated code.โ€‹