Databricks Community

MiriamHundemer · 4 weeks ago

Hi,

since April 1st (2026) we are having problems calling the databricks /api/2.2/jobs/runs/list and the /api/2.0/sql/history/queries endpoint. Calls to these endpoints sometimes seem to take longer than 60 seconds now using the databricks python sdk which leads to our script being killed.

What we do explicitly is implementing monitoring with Icinga2 for our job runs and queries, looking for runtime issues. This means that for the jobs we have our Icinga query the databricks jobs api for each of our jobs (which are 48 in one workspace at the moment) every 3 minutes. As far as I understand the documentation, the rate limit for the /jobs/runs/list is 30 calls per second per workspace so my understanding is that we should not run into rate limits.

Extending the Icinga timeout also shows, that the databricks api does not return any error, its just that establishing a connection apparently takes a really long time sometimes. What I find curious about this is that we only started having problems after the 1st of April without doing any changes in our setup. So I was wondering if there maybe were some changes done to the api?

Or could there be another explanation for this behaviour? We are running our workspaces on gcp in the zone "europe-west1".

amirabedhiafi · 4 weeks ago

Hello @MiriamHundemer !

I don't think this is a rate limit issue because it is indeed 30 req/sec per workspace but that only tells you when some throttling may happen and it does not guarantee that every call returns within 60 sec. Also don't forget that the pthon SDK has a default http_timeout_seconds of 60 sec which matches the failure boundary you are seeing and it also has retry behavior for safe failures.

Why don't you try to avoid one api call per job if possible ? because/jobs/runs/list can list runs across all jobs if job_id is omitted and for monitoring you can call it once per workspace with active_only=True or with start_time_from for the last monitoring window then filter by job id.

The api also supports limit, page_token, active_only, completed_only, start_time_from and start_time_to.

You have also the possibility to add timeout and rate limit explicitly in the SDK :

from databricks.sdk import WorkspaceClient

w = WorkspaceClient(
    http_timeout_seconds=30,
    retry_timeout_seconds=50,
    rate_limit=5
)

then set the Icinga plugin timeout slightly above the SDK retry timeout and this way you avoid Icinga killing the process while the SDK is still waiting/retrying.
One last thing for monitoring, system.lakeflow.job_run_timeline tracks job runs and metadata and system.query.history stores SQL warehouse or serverless query history and I think they are good in your case but don't forget that they are not real time, so I would still use the API for immediate active run checks.

If this answer resolves your question, could you please mark it as “Accept as Solution”? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP

Databricks Community

Calls to databricks api taking more than 60 seconds to complete

DAIS 2026 Speaker Spotlight Series #12 | Bryan Schaefer

FLASH SALE: Save 50% on Summit Training ⚡

🌟 Community Pulse: Your Weekly Roundup! May 18 – 24, 2026

Community BrickTalk: Using AI to Navigate Unfamiliar Business Data

Solution Accelerator Series | Survival Analysis for Churn and Lifetime Value