Hi @senkii,
There are two separate retry mechanisms in Databricks that can cause tasks to run again, and distinguishing between them is important for your situation.
1. TASK-LEVEL RETRIES (Workflows setting)
This is the "Retries" setting you configure in the task UI. By default, retries are set to 0 for triggered (non-continuous) jobs, meaning no automatic task retry should occur. To confirm this is disabled:
- Open your job in the Workflows UI.
- Click on the task.
- In the task configuration panel, look for the "Retries" section.
- Make sure it either says "No retries" or shows 0 retries.
- If a retry policy exists, remove it by clicking the X next to it.
If you are using the Jobs API or Databricks Asset Bundles, confirm that max_retries is set to 0 in your task definition:
"retry_on_timeout": false,
"max_retries": 0
Documentation: https://docs.databricks.com/en/jobs/configure-task.html
2. SPARK STAGE-LEVEL RETRIES (the likely culprit)
Even with task-level retries set to 0, Spark itself can retry failed stages internally. When a stage fails (for example, due to a data conversion error that causes a task executor to fail), Spark will retry that stage up to spark.task.maxFailures times (default is 4). This is a Spark-level behavior, not a Workflows-level retry.
This is the most common reason users see "retries" happening when they believe they have already disabled them. The Spark stage retry appears in the Spark UI as repeated stage attempts, and can make it look like the entire job is being retried.
To disable Spark stage retries, set this Spark configuration on your job cluster or task compute:
spark.task.maxFailures 1
You can set this in your job cluster's Spark Config section (Advanced Options > Spark > Spark Config), or via the API in the spark_conf field:
"spark_conf": {
"spark.task.maxFailures": "1"
}
Setting this to 1 means Spark will not retry any failed task and the stage will fail immediately on the first error.
3. SERVERLESS COMPUTE AUTO-OPTIMIZED RETRIES
If your job runs on serverless compute, there is an additional "auto-optimized retries" feature that is enabled by default. This allows the system to automatically retry tasks that fail due to transient issues. You can disable this in the task configuration:
- In the task settings, look for the "Retries" section.
- If you see "Auto-optimized by Databricks" or similar wording, click to expand it.
- Toggle off the auto-optimized retries option, or explicitly set retries to 0.
RECOMMENDATION FOR YOUR USE CASE
Since you are transforming bronze data to silver and expecting validation errors on bad records, consider using a try/except pattern that catches conversion errors within your notebook logic rather than letting them bubble up as task failures. This way:
- The task completes successfully even when some records fail validation.
- Bad records are written to your error table within the same run.
- No retries are triggered at any level.
For example, if you are using SQL:
INSERT INTO silver_table
SELECT
TRY_CAST(col1 AS DECIMAL(10,2)) AS col1,
TRY_CAST(col2 AS INT) AS col2
FROM bronze_table
WHERE TRY_CAST(col1 AS DECIMAL(10,2)) IS NOT NULL
AND TRY_CAST(col2 AS INT) IS NOT NULL;
INSERT INTO error_table
SELECT *, 'type_conversion_error' AS error_reason
FROM bronze_table
WHERE TRY_CAST(col1 AS DECIMAL(10,2)) IS NULL
OR TRY_CAST(col2 AS INT) IS NULL;
TRY_CAST returns NULL instead of raising an error when conversion fails, so your job will never fail due to bad data, and you get clean separation of valid and invalid records.
Documentation on TRY_CAST: https://docs.databricks.com/en/sql/language-manual/functions/try_cast.html
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.