cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

cancel running job kill the parent process and does not wait for streamings to stop

Sadam97
New Contributor III

Hi,

We have created databricks jobs and each has multiple tasks. Each task is 24/7 running streaming with checkpoint enabled. We want it to be stateful when cancel and run the job but it seems like, when we cancel the job run it kill the parent process at OS level and does not wait for the streamings in each task to stop. We are having data missing issues between our reporting and staging layer. As we have to cancel and rerun reporting job for changes and addition, it seems like this causes data missing. To resolve this data missing issue, we have to recompute whole reporting layer by dropping the checkpoint which is very big bottleneck for us. 

Is there a way to handle this issue, by prompting databricks job to wait for termination of streamings ? 

1 REPLY 1

Sidhant07
Databricks Employee
Databricks Employee

Hi @Sadam97 ,

This seems to be expected behaviour. 
If you are running the jobs in a job cluster:

In job clusters, the Databricks job scheduler treats all streaming queries within a task as belonging to the same job execution context. If any query fails, the overall job is marked as failed and all queries are stopped, aiming to avoid partial or inconsistent updates in automated workflows. For example, if you are having three different streams (e.g., bronze → silver → gold), dependent on each other. If the bronze stream fails with an error, the silver and gold streams won't have any new data to process, and the cluster will be ideal without processing any data, which is not expected.Therefore, this is an expected behaviour with regard to the job cluster as per design.

If you are running the jobs using interactive cluster then here queries are managed by individual notebook sessions, enabling isolated failure and restarts for each query wont affect others.


Recommendation

  • For Production: If you need true isolation, schedule streaming queries as separate tasks within a job as a workflow, or use distinct clusters. This avoids failures.
  • For Development: Interactive clusters provide more flexibility for multi-query execution.