cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to stop task retry

senkii
Databricks Partner

I would like to stop automatic retries, but the max retries configuration does not seem to work.
Could you please tell me how to disable retries? I would also like to understand why the task retries automatically.

I did not set any scheduler. I created a transform job manually.

My goal is to transform data in the bronze table (stored as strings) into the correct data types (e.g., decimal, int) and save the results into a silver table.
If some records cannot be converted to the target data type, I want a validation error to occur and for those records to be written to an error table.

The issue is that when a validation error occurs, the task in the job is automatically retried in most cases. This causes the transformation process to run twice and results in the same failure, which doubles the execution time unnecessarily.

I do not think this is related to data size. Even when I tested with a CSV file of only 7.89 KB, the job was still retried.

I want to disable this retry behavior, but the max retries configuration does not work.

senkii_0-1771320879821.png

senkii_1-1771320966643.pngsenkii_2-1771321009007.png

1 ACCEPTED SOLUTION

Accepted Solutions

saurabh18cs
Honored Contributor III
2 REPLIES 2

saurabh18cs
Honored Contributor III

Hi @senkii do this :

saurabh18cs_0-1771326691624.png

 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @senkii,

There are two separate retry mechanisms in Databricks that can cause tasks to run again, and distinguishing between them is important for your situation.

1. TASK-LEVEL RETRIES (Workflows setting)

This is the "Retries" setting you configure in the task UI. By default, retries are set to 0 for triggered (non-continuous) jobs, meaning no automatic task retry should occur. To confirm this is disabled:

- Open your job in the Workflows UI.
- Click on the task.
- In the task configuration panel, look for the "Retries" section.
- Make sure it either says "No retries" or shows 0 retries.
- If a retry policy exists, remove it by clicking the X next to it.

If you are using the Jobs API or Databricks Asset Bundles, confirm that max_retries is set to 0 in your task definition:

"retry_on_timeout": false,
"max_retries": 0

Documentation: https://docs.databricks.com/en/jobs/configure-task.html

2. SPARK STAGE-LEVEL RETRIES (the likely culprit)

Even with task-level retries set to 0, Spark itself can retry failed stages internally. When a stage fails (for example, due to a data conversion error that causes a task executor to fail), Spark will retry that stage up to spark.task.maxFailures times (default is 4). This is a Spark-level behavior, not a Workflows-level retry.

This is the most common reason users see "retries" happening when they believe they have already disabled them. The Spark stage retry appears in the Spark UI as repeated stage attempts, and can make it look like the entire job is being retried.

To disable Spark stage retries, set this Spark configuration on your job cluster or task compute:

spark.task.maxFailures 1

You can set this in your job cluster's Spark Config section (Advanced Options > Spark > Spark Config), or via the API in the spark_conf field:

"spark_conf": {
"spark.task.maxFailures": "1"
}

Setting this to 1 means Spark will not retry any failed task and the stage will fail immediately on the first error.

3. SERVERLESS COMPUTE AUTO-OPTIMIZED RETRIES

If your job runs on serverless compute, there is an additional "auto-optimized retries" feature that is enabled by default. This allows the system to automatically retry tasks that fail due to transient issues. You can disable this in the task configuration:

- In the task settings, look for the "Retries" section.
- If you see "Auto-optimized by Databricks" or similar wording, click to expand it.
- Toggle off the auto-optimized retries option, or explicitly set retries to 0.

RECOMMENDATION FOR YOUR USE CASE

Since you are transforming bronze data to silver and expecting validation errors on bad records, consider using a try/except pattern that catches conversion errors within your notebook logic rather than letting them bubble up as task failures. This way:

- The task completes successfully even when some records fail validation.
- Bad records are written to your error table within the same run.
- No retries are triggered at any level.

For example, if you are using SQL:

INSERT INTO silver_table
SELECT
TRY_CAST(col1 AS DECIMAL(10,2)) AS col1,
TRY_CAST(col2 AS INT) AS col2
FROM bronze_table
WHERE TRY_CAST(col1 AS DECIMAL(10,2)) IS NOT NULL
AND TRY_CAST(col2 AS INT) IS NOT NULL;

INSERT INTO error_table
SELECT *, 'type_conversion_error' AS error_reason
FROM bronze_table
WHERE TRY_CAST(col1 AS DECIMAL(10,2)) IS NULL
OR TRY_CAST(col2 AS INT) IS NULL;

TRY_CAST returns NULL instead of raising an error when conversion fails, so your job will never fail due to bad data, and you get clean separation of valid and invalid records.

Documentation on TRY_CAST: https://docs.databricks.com/en/sql/language-manual/functions/try_cast.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.