Calls to databricks api taking more than 60 seconds to complete

MiriamHundemer — Thu, 07 May 2026 08:46:51 GMT

Hi,

since April 1st (2026) we are having problems calling the databricks /api/2.2/jobs/runs/list and the /api/2.0/sql/history/queries endpoint. Calls to these endpoints sometimes seem to take longer than 60 seconds now using the databricks python sdk which leads to our script being killed.

What we do explicitly is implementing monitoring with Icinga2 for our job runs and queries, looking for runtime issues. This means that for the jobs we have our Icinga query the databricks jobs api for each of our jobs (which are 48 in one workspace at the moment) every 3 minutes. As far as I understand the documentation, the rate limit for the /jobs/runs/list is 30 calls per second per workspace so my understanding is that we should not run into rate limits.

Extending the Icinga timeout also shows, that the databricks api does not return any error, its just that establishing a connection apparently takes a really long time sometimes. What I find curious about this is that we only started having problems after the 1st of April without doing any changes in our setup. So I was wondering if there maybe were some changes done to the api?

Or could there be another explanation for this behaviour? We are running our workspaces on gcp in the zone "europe-west1".

Re: Calls to databricks api taking more than 60 seconds to complete

amirabedhiafi — Thu, 07 May 2026 09:12:39 GMT

Hello @MiriamHundemer !

I don't think this is a rate limit issue because it is indeed 30 req/sec per workspace but that only tells you when some throttling may happen and it does not guarantee that every call returns within 60 sec. Also don't forget that the pthon SDK has a default http_timeout_seconds of 60 sec which matches the failure boundary you are seeing and it also has retry behavior for safe failures.

Why don't you try to avoid one api call per job if possible ? because/jobs/runs/list can list runs across all jobs if job_id is omitted and for monitoring you can call it once per workspace with active_only=True or with start_time_from for the last monitoring window then filter by job id.

The api also supports limit, page_token, active_only, completed_only, start_time_from and start_time_to.

You have also the possibility to add timeout and rate limit explicitly in the SDK :

from databricks.sdk import WorkspaceClient

w = WorkspaceClient(
    http_timeout_seconds=30,
    retry_timeout_seconds=50,
    rate_limit=5
)

then set the Icinga plugin timeout slightly above the SDK retry timeout and this way you avoid Icinga killing the process while the SDK is still waiting/retrying.
One last thing for monitoring, system.lakeflow.job_run_timeline tracks job runs and metadata and system.query.history stores SQL warehouse or serverless query history and I think they are good in your case but don't forget that they are not real time, so I would still use the API for immediate active run checks.

topic Calls to databricks api taking more than 60 seconds to complete in Data Engineering

Calls to databricks api taking more than 60 seconds to complete

Re: Calls to databricks api taking more than 60 seconds to complete