cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Run failed with error message Cluster was terminated. Reason: JOB_FINISHED (SUCCESS)

holychs
Databricks Partner

I am running a notebook through workflow using all purpose cluster("data_security_mode": "USER_ISOLATION"). I am seeing some strange behaviour with the cluster during the run. While the job is still running cluster gets terminated with the Reason: Reason: JOB_FINISHED (SUCCESS). This causes the running job to fail with error, cluster was terminated. I am not able to find any details in the cluster event log or driver log. 

1 ACCEPTED SOLUTION

Accepted Solutions

anuj_lathi
Databricks Employee
Databricks Employee

Hi โ€” the JOB_FINISHED (SUCCESS) termination reason is the key clue here. It means another job that was using the same all-purpose cluster finished, and its completion triggered the cluster termination โ€” taking your still-running job down with it.

Most Likely Cause

When multiple workflows share the same all-purpose cluster via existing_cluster_id, any one of those jobs finishing can trigger the cluster lifecycle to mark it as "job finished." If the cluster's context gets tied to the completing job, it terminates even though your job is still active. This is a known pitfall of running workflows on shared all-purpose clusters.

Troubleshooting Steps

  1. Check what else was running on the cluster โ€” Go to Compute โ†’ select your cluster โ†’ Runs tab. Look for any other job/notebook that completed around the exact time your cluster was terminated. That's likely the culprit.
  2. Check cluster event log timing โ€” In the cluster's Event Log, correlate the termination event timestamp with any other job completions. Even if details are sparse, the timestamp match will confirm the root cause.
  3. Check for `dbutils.notebook.exit()` โ€” If your notebook (or any notebook in the workflow) calls dbutils.notebook.exit("Success") in conditional logic, it can signal job completion prematurely.
  4. Check auto-termination settings โ€” If set too aggressively, the cluster may interpret a brief idle gap between tasks as inactivity. Look at Compute โ†’ Edit โ†’ Auto Termination value.

Recommended Fix

Switch to a job cluster (strongest fix):

Instead of pointing your workflow at an all-purpose cluster, configure the workflow to use a job cluster. Each workflow run gets its own dedicated cluster that only terminates when that specific workflow finishes. This completely eliminates the shared-cluster race condition.

In your workflow JSON config, replace:

"existing_cluster_id": "xxxx-xxxxxx-xxxxxxxx"

 

with:

"job_clusters": [{

  "job_cluster_key": "my_job_cluster",

  "new_cluster": {

    "spark_version": "15.4.x-scala2.12",

    "node_type_id": "your_instance_type",

    "num_workers": 2,

    "data_security_mode": "USER_ISOLATION"

  }

}]

 

Or in the UI: Edit workflow โ†’ Task โ†’ Cluster dropdown โ†’ select "New job cluster" instead of an existing all-purpose cluster.

Other alternatives:

  • Serverless compute โ€” no cluster management at all, fully isolated per job
  • Dedicated all-purpose cluster โ€” if you must use all-purpose, ensure no other jobs/workflows are configured to use the same cluster

Why All-Purpose Clusters Are Risky for Workflows

All-purpose clusters are designed for interactive, multi-user use. When workflows attach to them, the cluster lifecycle becomes unpredictable because multiple consumers (notebooks, workflows, SQL queries) compete for the same cluster context. Job clusters exist specifically to solve this โ€” they provide 1:1 isolation between a workflow run and its compute.

Docs:

Hope this helps track it down!

Anuj Lathi
Solutions Engineer @ Databricks

View solution in original post

2 REPLIES 2

Raman_Unifeye
Honored Contributor III

@holychs - Well, this behaviour needs troubleshooting I imagine.

- What is the auto-termination value. Try increasing it to much higher value and observe if it is the same.

- Does your workflow have multiple notebook tasks? If Task A finishes while Task B is still running, a glitch in the job contxt can sometime trigger a cluster teardown if the cluster was pinned to the job.

- Does your notebook contains conditional logic that calls dbutils.notebook.exit("Success") ?

- Are you triggering this job manually while someone else is using the cluster?

Also, check the Runs on the cluster. Go to the Compute page -> Select your Cluster -> Runs tab. This will show you exactly which jobs/notebooks were attached to that cluster at the moment of termination.

 


RG #Driving Business Outcomes with Data Intelligence

anuj_lathi
Databricks Employee
Databricks Employee

Hi โ€” the JOB_FINISHED (SUCCESS) termination reason is the key clue here. It means another job that was using the same all-purpose cluster finished, and its completion triggered the cluster termination โ€” taking your still-running job down with it.

Most Likely Cause

When multiple workflows share the same all-purpose cluster via existing_cluster_id, any one of those jobs finishing can trigger the cluster lifecycle to mark it as "job finished." If the cluster's context gets tied to the completing job, it terminates even though your job is still active. This is a known pitfall of running workflows on shared all-purpose clusters.

Troubleshooting Steps

  1. Check what else was running on the cluster โ€” Go to Compute โ†’ select your cluster โ†’ Runs tab. Look for any other job/notebook that completed around the exact time your cluster was terminated. That's likely the culprit.
  2. Check cluster event log timing โ€” In the cluster's Event Log, correlate the termination event timestamp with any other job completions. Even if details are sparse, the timestamp match will confirm the root cause.
  3. Check for `dbutils.notebook.exit()` โ€” If your notebook (or any notebook in the workflow) calls dbutils.notebook.exit("Success") in conditional logic, it can signal job completion prematurely.
  4. Check auto-termination settings โ€” If set too aggressively, the cluster may interpret a brief idle gap between tasks as inactivity. Look at Compute โ†’ Edit โ†’ Auto Termination value.

Recommended Fix

Switch to a job cluster (strongest fix):

Instead of pointing your workflow at an all-purpose cluster, configure the workflow to use a job cluster. Each workflow run gets its own dedicated cluster that only terminates when that specific workflow finishes. This completely eliminates the shared-cluster race condition.

In your workflow JSON config, replace:

"existing_cluster_id": "xxxx-xxxxxx-xxxxxxxx"

 

with:

"job_clusters": [{

  "job_cluster_key": "my_job_cluster",

  "new_cluster": {

    "spark_version": "15.4.x-scala2.12",

    "node_type_id": "your_instance_type",

    "num_workers": 2,

    "data_security_mode": "USER_ISOLATION"

  }

}]

 

Or in the UI: Edit workflow โ†’ Task โ†’ Cluster dropdown โ†’ select "New job cluster" instead of an existing all-purpose cluster.

Other alternatives:

  • Serverless compute โ€” no cluster management at all, fully isolated per job
  • Dedicated all-purpose cluster โ€” if you must use all-purpose, ensure no other jobs/workflows are configured to use the same cluster

Why All-Purpose Clusters Are Risky for Workflows

All-purpose clusters are designed for interactive, multi-user use. When workflows attach to them, the cluster lifecycle becomes unpredictable because multiple consumers (notebooks, workflows, SQL queries) compete for the same cluster context. Job clusters exist specifically to solve this โ€” they provide 1:1 isolation between a workflow run and its compute.

Docs:

Hope this helps track it down!

Anuj Lathi
Solutions Engineer @ Databricks