cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pausing a scheduled Azure Databricks job after failure

Dipesh
New Contributor II

Hi All,

I have a job/workflow scheduled in Databricks to run after every hour.

How can I configure my Job to pause whenever a job run fails? (Pause the job/workflow on first failure)

I would want to prevent triggering multiple runs due to the scheduled/un-paused state of the job after the first failure and resume the schedule after the issue is fixed.

Thank you.

4 REPLIES 4

shan_chandra
Esteemed Contributor
Esteemed Contributor

@Dipesh Yogi​ - Please refer to the current behavior.

when you schedule workflow dependencies and configure each job has a task with a dependency of task2 to start until task1 complete. subsequent runs will not be triggered with the below message

Task <Task-name> failed. This caused all downstream tasks to get skipped.

Reference - https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/jobs#--task-dependencies.

The below documentation also explains the Repair and rerun feature of the workflows which address your specific scenario but only at the individual run level.

https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/how-to-fix-job-failures

Unfortunately, There is no mechanism currently to pause the workflow schedules after the first failure. However, you can create alerts to your email on the failure and upon receiving the alerts, manually stop the schedule. we will work internally on this new feature request to pause the schedule and it will be picked up based on the prioritization. Thanks for bringing this up!!!

Dipesh
New Contributor II

@Shanmugavel Chandrakasu​ Thank you for your response. We have enabled Databricks alerts but would be a problem during weekends and holidays 🙂

Also the data get updated after each runs so repairing the run after we detect the failure would lead to us losing some data.

Looking forward for this new feature.

Thanks again.🙏

Hubert-Dudek
Esteemed Contributor III

You can pause a job using jobs Rest API. Just call it from the notebook when you catch the exception. https://<databricks-instance>/api/2.1/jobs/update

{
   "job_id":11223344,
   "new_settings":{
      "schedule":{
         "pause_status":"PAUSED"
      }
   }
}

more info here https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsUpdate

Dipesh
New Contributor II

Hi @Hubert Dudek​ , Thank you for your suggestion.

I understand that we can use Jobs API to change the pasue_status of job on errors, but sometimes we observed that the workflow/job fails due to cluster issues (while the job clusters are getting created) and before any of our code gets executed. In such scenarios I was wondering if there is any way to automatically pause the job.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!