Hi โ the JOB_FINISHED (SUCCESS) termination reason is the key clue here. It means another job that was using the same all-purpose cluster finished, and its completion triggered the cluster termination โ taking your still-running job down with it.
Most Likely Cause
When multiple workflows share the same all-purpose cluster via existing_cluster_id, any one of those jobs finishing can trigger the cluster lifecycle to mark it as "job finished." If the cluster's context gets tied to the completing job, it terminates even though your job is still active. This is a known pitfall of running workflows on shared all-purpose clusters.
Troubleshooting Steps
- Check what else was running on the cluster โ Go to Compute โ select your cluster โ Runs tab. Look for any other job/notebook that completed around the exact time your cluster was terminated. That's likely the culprit.
- Check cluster event log timing โ In the cluster's Event Log, correlate the termination event timestamp with any other job completions. Even if details are sparse, the timestamp match will confirm the root cause.
- Check for `dbutils.notebook.exit()` โ If your notebook (or any notebook in the workflow) calls dbutils.notebook.exit("Success") in conditional logic, it can signal job completion prematurely.
- Check auto-termination settings โ If set too aggressively, the cluster may interpret a brief idle gap between tasks as inactivity. Look at Compute โ Edit โ Auto Termination value.
Recommended Fix
Switch to a job cluster (strongest fix):
Instead of pointing your workflow at an all-purpose cluster, configure the workflow to use a job cluster. Each workflow run gets its own dedicated cluster that only terminates when that specific workflow finishes. This completely eliminates the shared-cluster race condition.
In your workflow JSON config, replace:
"existing_cluster_id": "xxxx-xxxxxx-xxxxxxxx"
with:
"job_clusters": [{
"job_cluster_key": "my_job_cluster",
"new_cluster": {
"spark_version": "15.4.x-scala2.12",
"node_type_id": "your_instance_type",
"num_workers": 2,
"data_security_mode": "USER_ISOLATION"
}
}]
Or in the UI: Edit workflow โ Task โ Cluster dropdown โ select "New job cluster" instead of an existing all-purpose cluster.
Other alternatives:
- Serverless compute โ no cluster management at all, fully isolated per job
- Dedicated all-purpose cluster โ if you must use all-purpose, ensure no other jobs/workflows are configured to use the same cluster
Why All-Purpose Clusters Are Risky for Workflows
All-purpose clusters are designed for interactive, multi-user use. When workflows attach to them, the cluster lifecycle becomes unpredictable because multiple consumers (notebooks, workflows, SQL queries) compete for the same cluster context. Job clusters exist specifically to solve this โ they provide 1:1 isolation between a workflow run and its compute.
Docs:
Hope this helps track it down!
Anuj Lathi
Solutions Engineer @ Databricks