cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Continuous Job - How to set max_retries

Kirankumarbs
Contributor

Hello Community,

I have a couple of continuous workflows (jobs) running in production, and they’ve been working well so far. However, we’re seeing some transient failures that are causing the entire job to restart — which I’d prefer to avoid.

While we investigate and address the root cause, we’d like to get some time and put a temporary solution in place…

      continuous:
        pause_status: UNPAUSED
        task_retry_mode: ON_FAILURE

And I can see that it’s configured(default) with 4 retries.

Screenshot 2026-02-17 at 07.52.23.png

My question is: is there a way to increase the maximum retries to 10 or set it to a custom number?

I also checked the documentation, but nothing specifically mentioned a custom retry number!

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @Kirankumarbs ,

It's a limitation - when Task retry mode is set to On failure, failed tasks are retried with an exponentially increasing delay until the maximum number of allowed retries is reached (three for a single task job).

Run jobs continuously | Databricks on AWS

szymon_dybczak_0-1771331163557.png

 

View solution in original post

8 REPLIES 8

szymon_dybczak
Esteemed Contributor III

Hi @Kirankumarbs ,

It's a limitation - when Task retry mode is set to On failure, failed tasks are retried with an exponentially increasing delay until the maximum number of allowed retries is reached (three for a single task job).

Run jobs continuously | Databricks on AWS

szymon_dybczak_0-1771331163557.png

 

Yep, exactly as I also mentioned in the post! I thought if there is something I can use the Job API for or get done in any other way!

Thanks for the reply!

szymon_dybczak
Esteemed Contributor III

Hi @Kirankumarbs ,

No, unfortunately this is a limitation 

IM_01
Contributor

Hi 

If the serverless auto optimisation is enabled then by default the retry count is 3. disable_auto_optimization: true for a task may help to avoid retries but may add latency to the compute startup and execution speed of task 

Hi @IM_01 ,

Thanks for the reply!

Can you explain a little more? What do you mean disable_auto_optimization: true?

As I mentioned in the post description, retries are okay to me, and in fact i would like to increase retries count if possible!

I have a plan to fix the root cause or architectural issue in my design, but just trying to get some time with these retries!

IM_01
Contributor

@Kirankumarbs  the option would remove default retries and then you can add custom retry logic within script and can dynamically change number of retries as a task parameter.
hope this helps 🙂 

Ahh, I see!

We are using on demand Job cluster and not serverless! As per the documentation

disable_auto_optimization
boolean
Default false
Example true
An option to disable auto optimization in serverless

 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @Kirankumarbs,

Thanks for posting this. I can see from the thread that the existing replies confirmed the limitation but did not offer much in the way of alternatives. Let me provide a fuller picture.


THE SHORT ANSWER

You cannot set a custom max_retries value on a continuous job. The documentation states: "You cannot use retry policies in a continuous job." The 3 task-level retries you see for a single-task continuous job are system-defined and not configurable.


HOW CONTINUOUS JOB RETRIES ACTUALLY WORK

Continuous jobs have a two-tier retry system that is fully managed by Databricks:

1. TASK-LEVEL RETRIES (first tier)
When "task_retry_mode" is set to "ON_FAILURE" (which you already have configured), a failed task is retried with an exponentially increasing delay. For a single-task job, the maximum is 3 retries. You cannot change this number.

2. JOB-LEVEL RESTARTS (second tier)
Once task-level retries are exhausted, the entire run is canceled and a brand new run is triggered automatically. If that new run also fails, the delay between restarts increases (exponential backoff). This continues indefinitely -- there is no limit on job-level restarts for continuous jobs. The backoff delay eventually caps at a system-defined maximum, and restarts continue at that interval until you pause the job.

3. RECOVERY AND RESET
If a run completes successfully, or if the run exceeds a threshold period without failure, the job is considered healthy and the backoff sequence resets back to short intervals.

The important takeaway: even though you only get 3 task-level retries per run, the job will never permanently stop. It will keep launching new runs with exponential backoff. So your continuous job will recover from transient failures -- the question is just how quickly.


WHAT YOU CAN DO TO IMPROVE RESILIENCE

Since you mentioned you want to buy time while you fix the root cause, here are some practical approaches:

1. ADD ERROR HANDLING DIRECTLY IN YOUR CODE
Build retry logic into your notebook or script for the specific operations that fail transiently. For example, wrap API calls or external connections in a retry loop with configurable count and delay. This gives you full control over retry behavior for specific failure modes, independent of the job-level retry system.

2. USE STRUCTURED STREAMING FAULT TOLERANCE
If your continuous job runs a Structured Streaming workload, it already has built-in checkpointing and fault tolerance. When the job restarts, the stream picks up from where it left off. The combination of streaming checkpoints and the continuous job's automatic restart behavior means transient failures are handled gracefully.

3. SWITCH TO A TRIGGERED JOB ON A TIGHT SCHEDULE
If you truly need a configurable retry count, consider using a triggered job scheduled on a short interval (e.g., every 1 minute) instead of a continuous job. Triggered jobs support:
- max_retries (set to 10 or any number you want)
- min_retry_interval_millis (delay between retries)
- retry_on_timeout (true/false)

In your DABs YAML, that would look like:

resources:
jobs:
my-job:
name: my-job
trigger:
periodic:
interval: 1
unit: MINUTES
tasks:
- task_key: my-task
max_retries: 10
min_retry_interval_millis: 60000
retry_on_timeout: false
notebook_task:
notebook_path: ./my_notebook.py

The trade-off is that triggered jobs do not automatically restart after the run completes -- each run is independent. But with a 1-minute schedule you get near-continuous behavior with full retry control.

4. PROGRAMMATIC MONITORING AND CONTROL
You can use the Jobs API to monitor your continuous job and take action on repeated failures. For example, a lightweight monitoring job could check run history and send alerts or pause/unpause the job as needed.


WHY CONTINUOUS JOBS DO NOT EXPOSE max_retries

The design philosophy for continuous jobs is that they should always be running. Rather than failing permanently after N retries, the system uses exponential backoff to handle transient issues while avoiding tight retry loops that waste resources. After task-level retries are exhausted, a new run starts automatically, so the job is self-healing by design.


DOCUMENTATION REFERENCES

- Continuous jobs overview and failure handling:
https://docs.databricks.com/aws/en/jobs/continuous

- Task configuration (retry settings for triggered jobs):
https://docs.databricks.com/aws/en/jobs/configure-task

- Databricks Asset Bundles job configuration:
https://docs.databricks.com/aws/en/dev-tools/bundles/reference

Hope this helps clarify the behavior. The key takeaway is that your continuous job will never permanently stop retrying -- it just restarts with increasing delays between runs. If you need precise control over retry counts, a triggered job on a tight schedule is the way to go.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.