โ01-21-2026 10:36 PM
I am running a notebook through workflow using all purpose cluster("data_security_mode": "USER_ISOLATION"). I am seeing some strange behaviour with the cluster during the run. While the job is still running cluster gets terminated with the Reason: Reason: JOB_FINISHED (SUCCESS). This causes the running job to fail with error, cluster was terminated. I am not able to find any details in the cluster event log or driver log.
Wednesday
Hi โ the JOB_FINISHED (SUCCESS) termination reason is the key clue here. It means another job that was using the same all-purpose cluster finished, and its completion triggered the cluster termination โ taking your still-running job down with it.
When multiple workflows share the same all-purpose cluster via existing_cluster_id, any one of those jobs finishing can trigger the cluster lifecycle to mark it as "job finished." If the cluster's context gets tied to the completing job, it terminates even though your job is still active. This is a known pitfall of running workflows on shared all-purpose clusters.
Switch to a job cluster (strongest fix):
Instead of pointing your workflow at an all-purpose cluster, configure the workflow to use a job cluster. Each workflow run gets its own dedicated cluster that only terminates when that specific workflow finishes. This completely eliminates the shared-cluster race condition.
In your workflow JSON config, replace:
"existing_cluster_id": "xxxx-xxxxxx-xxxxxxxx"
with:
"job_clusters": [{
"job_cluster_key": "my_job_cluster",
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "your_instance_type",
"num_workers": 2,
"data_security_mode": "USER_ISOLATION"
}
}]
Or in the UI: Edit workflow โ Task โ Cluster dropdown โ select "New job cluster" instead of an existing all-purpose cluster.
Other alternatives:
All-purpose clusters are designed for interactive, multi-user use. When workflows attach to them, the cluster lifecycle becomes unpredictable because multiple consumers (notebooks, workflows, SQL queries) compete for the same cluster context. Job clusters exist specifically to solve this โ they provide 1:1 isolation between a workflow run and its compute.
Docs:
Hope this helps track it down!
โ01-22-2026 01:23 AM
@holychs - Well, this behaviour needs troubleshooting I imagine.
- What is the auto-termination value. Try increasing it to much higher value and observe if it is the same.
- Does your workflow have multiple notebook tasks? If Task A finishes while Task B is still running, a glitch in the job contxt can sometime trigger a cluster teardown if the cluster was pinned to the job.
- Does your notebook contains conditional logic that calls dbutils.notebook.exit("Success") ?
- Are you triggering this job manually while someone else is using the cluster?
Also, check the Runs on the cluster. Go to the Compute page -> Select your Cluster -> Runs tab. This will show you exactly which jobs/notebooks were attached to that cluster at the moment of termination.
Wednesday
Hi โ the JOB_FINISHED (SUCCESS) termination reason is the key clue here. It means another job that was using the same all-purpose cluster finished, and its completion triggered the cluster termination โ taking your still-running job down with it.
When multiple workflows share the same all-purpose cluster via existing_cluster_id, any one of those jobs finishing can trigger the cluster lifecycle to mark it as "job finished." If the cluster's context gets tied to the completing job, it terminates even though your job is still active. This is a known pitfall of running workflows on shared all-purpose clusters.
Switch to a job cluster (strongest fix):
Instead of pointing your workflow at an all-purpose cluster, configure the workflow to use a job cluster. Each workflow run gets its own dedicated cluster that only terminates when that specific workflow finishes. This completely eliminates the shared-cluster race condition.
In your workflow JSON config, replace:
"existing_cluster_id": "xxxx-xxxxxx-xxxxxxxx"
with:
"job_clusters": [{
"job_cluster_key": "my_job_cluster",
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "your_instance_type",
"num_workers": 2,
"data_security_mode": "USER_ISOLATION"
}
}]
Or in the UI: Edit workflow โ Task โ Cluster dropdown โ select "New job cluster" instead of an existing all-purpose cluster.
Other alternatives:
All-purpose clusters are designed for interactive, multi-user use. When workflows attach to them, the cluster lifecycle becomes unpredictable because multiple consumers (notebooks, workflows, SQL queries) compete for the same cluster context. Job clusters exist specifically to solve this โ they provide 1:1 isolation between a workflow run and its compute.
Docs:
Hope this helps track it down!