Databricks Community

Muralidharan_A · Friday

We have a dlt pipeline which creates some same table, which are created based on some transformation and those transformation are kept inside a function in a seperate file. and those file were used using import function.
we are deploying those changes via terraform into the databricks, now the problem was everytime you run the dlt pipeline it will execute the pipeline without any issue, some time later on if deploy some code 'A'changes then it shows suppoerting file does not exists, but if we redeploy the changes we were able to run the pipeline.

But in one of the DLT pipeline, we are using retry_on_failure in pipeline, there if there was any similar issue then pipeline will fail at first but eventullay it will successed in the next run which got triggered by the above option.

Now my question was after deploying the terraform if it fails for the 1st run then we do a manual refresh which is similar to retry_on_failure, but still it fails what could be the reason and does retry_on_failure do something more then just a refresh?

Ashwin_DSA · Friday

Hi @Muralidharan_A,

To your question about whether retry_on_failure does more than a manual refresh, the answer is yes!

retry_on_failure (along with pipelines.numUpdateRetryAttempts and pipelines.maxFlowRetryAttempts) performs classified, timed retries within the same update on the same cluster. A manual Refresh is a brand-new update with none of that handling.

Lakeflow Spark Declarative Pipelines only auto-retries errors it tags as retryable (transient I/O, library resolution, file-system races). A manual Refresh reruns regardless of error type, so a deterministic failure will fail again. The retry fires seconds later, by which time the supporting file has usually propagated. Manual Refresh triggered immediately after the failure re-enters the same race.

So, in your case... after a Terraform deploy, there's a brief window where the pipeline definition is live, but the imported Python file isn't fully visible to the DLT cluster. The first run fails with "supporting file does not exist."... retry_on_failure waits and retries within the same update, by which point the file has propagated. Manual refresh starts a new update too quickly and hits the same problem, so it keeps failing until you redeploy (which effectively gives the file system enough time to catch up).

The best thing to do would be to add depends_on in Terraform so the pipeline resource waits for the supporting files/wheels to exist before creation or update. You can also declare the helper code as a pipeline library (wheel via libraries { whl = ... } or direct notebook/file reference) instead of an ad-hoc import. This makes Spark Declarative Pipelines aware of the dependency at definition time rather than discovering it at import time.

Another tip is to set pipelines.numUpdateRetryAttempts and/or pipelines.maxFlowRetryAttempts in all pipeline configs so transient deploy-time races self-heal without manual intervention.

If you keep import, consider %run for helper files to avoid Python's module cache. The wheel approach is cleaner for production.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***