Pausing a scheduled Azure Databricks job after failure
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 06:27 AM
Hi All,
I have a job/workflow scheduled in Databricks to run after every hour.
How can I configure my Job to pause whenever a job run fails? (Pause the job/workflow on first failure)
I would want to prevent triggering multiple runs due to the scheduled/un-paused state of the job after the first failure and resume the schedule after the issue is fixed.
Thank you.
- Labels:
-
Azure
-
Azure databricks
-
Job Run
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 09:41 AM
@Dipesh Yogi - Please refer to the current behavior.
when you schedule workflow dependencies and configure each job has a task with a dependency of task2 to start until task1 complete. subsequent runs will not be triggered with the below message
Task <Task-name> failed. This caused all downstream tasks to get skipped.
Reference - https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/jobs#--task-dependencies.
The below documentation also explains the Repair and rerun feature of the workflows which address your specific scenario but only at the individual run level.
https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/how-to-fix-job-failures
Unfortunately, There is no mechanism currently to pause the workflow schedules after the first failure. However, you can create alerts to your email on the failure and upon receiving the alerts, manually stop the schedule. we will work internally on this new feature request to pause the schedule and it will be picked up based on the prioritization. Thanks for bringing this up!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 08:50 PM
@Shanmugavel Chandrakasu Thank you for your response. We have enabled Databricks alerts but would be a problem during weekends and holidays 🙂
Also the data get updated after each runs so repairing the run after we detect the failure would lead to us losing some data.
Looking forward for this new feature.
Thanks again.🙏
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 11:16 AM
You can pause a job using jobs Rest API. Just call it from the notebook when you catch the exception. https://<databricks-instance>/api/2.1/jobs/update
{
"job_id":11223344,
"new_settings":{
"schedule":{
"pause_status":"PAUSED"
}
}
}
more info here https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsUpdate
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 08:53 PM
Hi @Hubert Dudek , Thank you for your suggestion.
I understand that we can use Jobs API to change the pasue_status of job on errors, but sometimes we observed that the workflow/job fails due to cluster issues (while the job clusters are getting created) and before any of our code gets executed. In such scenarios I was wondering if there is any way to automatically pause the job.