cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Pausing a scheduled Azure Databricks job after failure

Dipesh
New Contributor II

Hi All,

I have a job/workflow scheduled in Databricks to run after every hour.

How can I configure my Job to pause whenever a job run fails? (Pause the job/workflow on first failure)

I would want to prevent triggering multiple runs due to the scheduled/un-paused state of the job after the first failure and resume the schedule after the issue is fixed.

Thank you.

4 REPLIES 4

shan_chandra
Databricks Employee
Databricks Employee

@Dipesh Yogiโ€‹ - Please refer to the current behavior.

when you schedule workflow dependencies and configure each job has a task with a dependency of task2 to start until task1 complete. subsequent runs will not be triggered with the below message

Task <Task-name> failed. This caused all downstream tasks to get skipped.

Reference - https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/jobs#--task-dependencies.

The below documentation also explains the Repair and rerun feature of the workflows which address your specific scenario but only at the individual run level.

https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/how-to-fix-job-failures

Unfortunately, There is no mechanism currently to pause the workflow schedules after the first failure. However, you can create alerts to your email on the failure and upon receiving the alerts, manually stop the schedule. we will work internally on this new feature request to pause the schedule and it will be picked up based on the prioritization. Thanks for bringing this up!!!

Dipesh
New Contributor II

@Shanmugavel Chandrakasuโ€‹ Thank you for your response. We have enabled Databricks alerts but would be a problem during weekends and holidays ๐Ÿ™‚

Also the data get updated after each runs so repairing the run after we detect the failure would lead to us losing some data.

Looking forward for this new feature.

Thanks again.๐Ÿ™

Hubert-Dudek
Esteemed Contributor III

You can pause a job using jobs Rest API. Just call it from the notebook when you catch the exception. https://<databricks-instance>/api/2.1/jobs/update

{
   "job_id":11223344,
   "new_settings":{
      "schedule":{
         "pause_status":"PAUSED"
      }
   }
}

more info here https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsUpdate

Dipesh
New Contributor II

Hi @Hubert Dudekโ€‹ , Thank you for your suggestion.

I understand that we can use Jobs API to change the pasue_status of job on errors, but sometimes we observed that the workflow/job fails due to cluster issues (while the job clusters are getting created) and before any of our code gets executed. In such scenarios I was wondering if there is any way to automatically pause the job.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group