cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Job tasks failing with error "Failed to fetch SQL file" when file exists

eric-c
New Contributor II

I have a job with anywhere from 500-1000 sql tasks where the sql task is using a sql warehouse instance and running a sql script stored in a warehouse path like /Workspace/folder/file.sql. The sql task will fail with the error:

Run failed with error message 
Failed to fetch SQL file. Please make sure that a SQL file exists at '/Workspace/<path to file>'

The file exists (I know because I click on the link in the job task and it opens the file) and executing a "repair run" usually succeeds. 

3 REPLIES 3

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @eric-c,

This comes up fairly often, so let me break it down. The fact that the file exists (you can open it by clicking the link) and that repair runs succeed tells me this is a transient issue related to having a very large number of SQL tasks (500-1000) all trying to resolve their source files at approximately the same time, rather than a misconfiguration problem. Let me explain what is likely happening and how to address it.


MOST LIKELY CAUSE: RATE LIMITING ON FILE RESOLUTION

When a job with hundreds of SQL file tasks kicks off, the Databricks job scheduler needs to fetch and validate each SQL file before execution. If your SQL files are stored in a Git folder (which is accessed through the /Workspace/ path), the internal API calls to resolve those files are subject to rate limits.

Specifically, the Repos API has a combined rate limit of 10 requests per second across all /repos/* endpoints. With 500-1000 tasks all launching and trying to fetch their SQL files in a short window, it is very likely that some of those fetch requests are being throttled or timing out, resulting in the "Failed to fetch SQL file" error. The fact that repair runs succeed confirms this -- by the time you run a repair, only the failed tasks are retried, which is a much smaller number of concurrent file fetches.


RECOMMENDED SOLUTIONS

1. Use the "Git provider" source instead of "Workspace" source

Rather than pointing each SQL task to a /Workspace/ path inside a Git folder, configure the job to use the "Git provider" source option. This tells the job to pull the SQL files directly from your remote Git repository (GitHub, Azure DevOps, GitLab, etc.) at a specific branch, tag, or commit.

Benefits:
- The job fetches the repository contents once, rather than resolving each file individually through the workspace/repos API
- Avoids the per-file workspace API rate limits
- Gives you a clean, versioned reference to your SQL files
- Only one remote Git repository can be configured per job, but all tasks share it

To configure this:
a. Open your job in the Jobs UI
b. Click "Edit" on one of your SQL file tasks
c. Change the "Source" dropdown from "Workspace" to "Git provider"
d. Configure the repository URL, branch/tag, and enter the RELATIVE path to the SQL file (e.g., folder/file.sql -- do NOT start with / or ./)
e. Repeat for all SQL tasks, or use the Jobs API / Databricks Asset Bundles to update them programmatically

Documentation: https://docs.databricks.com/en/jobs/sql.html


2. Add retry policies to your tasks

If switching to Git provider source is not immediately feasible, configure automatic retries on your SQL file tasks. This way, transient fetch failures will be retried automatically without manual intervention.

To add retries:
a. Open your job and click on a task
b. Go to the task's advanced settings
c. Set "Maximum number of retries" to 1 or 2
d. Optionally set a "Minimum interval between retries" (e.g., 30 seconds) to give time for rate limits to reset

This is a pragmatic workaround that handles the intermittent nature of the issue. With a retry, the second attempt will likely succeed since there will be less contention.

Documentation: https://docs.databricks.com/en/jobs/configure-task.html


3. Stagger task execution with dependencies

If many of your 500-1000 tasks can run in smaller batches, consider organizing them into groups using task dependencies. For example, instead of launching all tasks simultaneously, create a dependency chain so that groups of 50-100 tasks start sequentially. This reduces the burst of concurrent file fetches.


4. Move SQL files out of the Git folder

If you do not need version control for these SQL files within a Git folder, consider storing them directly in a regular workspace folder (not inside a Git folder). Workspace file access outside of Git folders does not go through the Repos API and therefore is not subject to the same 10 requests/second rate limit.

You can move files to a shared workspace folder like:
/Workspace/Shared/sql-tasks/file.sql

This removes the Git folder layer from the file resolution path entirely.


5. Use the Jobs API or Asset Bundles for bulk updates

Since you have 500-1000 tasks, making changes manually in the UI would be impractical. Consider using:

- Databricks REST API (Jobs API) to programmatically update task configurations: https://docs.databricks.com/api/workspace/jobs
- Databricks Asset Bundles (DABs) to define your job as code in YAML, which makes it easy to change source settings across all tasks at once: https://docs.databricks.com/en/dev-tools/bundles/index.html


WHY REPAIR RUNS SUCCEED

When you run a repair, Databricks only re-executes the failed tasks. If, say, 50 out of 1000 tasks failed due to rate limiting, the repair run only needs to fetch 50 SQL files instead of 1000. This much smaller number of concurrent fetches stays well within the rate limits, which is why repairs almost always succeed.


ADDITIONAL DOCUMENTATION

- SQL file task configuration: https://docs.databricks.com/en/jobs/sql.html
- Job task configuration and retries: https://docs.databricks.com/en/jobs/configure-task.html
- Git folders overview: https://docs.databricks.com/en/repos/
- Databricks resource limits: https://docs.databricks.com/en/resources/limits.html
- Databricks Asset Bundles: https://docs.databricks.com/en/dev-tools/bundles/index.html

I would recommend starting with option 2 (retries) as a quick fix, and then working toward option 1 (Git provider source) or option 4 (moving files out of the Git folder) as a longer-term solution.

Hope this helps!

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

Solution 1 and 2, using retry and use GIT instead of WORKSPACE does not work. The SQL task will not retry when file is not found. We tried using git repo instead of workspace sync but still (if not worse) fails on some tasks. 

I think this should be considered as a bug in Databricks . It  should ensure task files is synced before attemping to run task regardless of API rate limiting. Our solution was to run the SQL-tasks-job using a job_task with retries, if/when a task in the SQL-job fails the parent job will retry (at least it should ๐Ÿ™‚ )

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @antgei ,

Thanks for sharing your experience and the workaround. You raise a valid point -- the platform should ensure task files are fully synced before attempting execution, regardless of API rate limiting on the backend. When a job has hundreds of SQL file tasks referencing files in a Git folder (accessed via the /Workspace/ path), the scheduler needs to fetch and validate each one, and under heavy load it can hit internal rate limits that cause transient "Failed to fetch SQL file" errors even though the files exist.

Your workaround of wrapping the SQL-tasks-job inside a parent job_task with retries is a reasonable approach for now. A few other things worth trying if you have not already:

  • Consolidate SQL files where possible to reduce the total number of file-fetch operations per job run
  • Stagger task execution using task dependencies rather than launching all SQL tasks in parallel, which can reduce the burst of file-fetch API calls
  • If using Git folders, consider copying the SQL files to a regular Workspace directory (non-Git-backed) before the job runs, since Git-backed workspace paths have additional sync overhead

That said, I agree this behavior warrants a product improvement. I will flag this internally to ensure the engineering team is aware of the pattern. If you have not already, I would also recommend opening a support ticket referencing this thread so it gets tracked formally.