Hello @MiriamHundemer !
I don't think this is a rate limit issue because it is indeed 30 req/sec per workspace but that only tells you when some throttling may happen and it does not guarantee that every call returns within 60 sec. Also don't forget that the pthon SDK has a default http_timeout_seconds of 60 sec which matches the failure boundary you are seeing and it also has retry behavior for safe failures.
Why don't you try to avoid one api call per job if possible ? because/jobs/runs/list can list runs across all jobs if job_id is omitted and for monitoring you can call it once per workspace with active_only=True or with start_time_from for the last monitoring window then filter by job id.
The api also supports limit, page_token, active_only, completed_only, start_time_from and start_time_to.
You have also the possibility to add timeout and rate limit explicitly in the SDK :
from databricks.sdk import WorkspaceClient
w = WorkspaceClient(
http_timeout_seconds=30,
retry_timeout_seconds=50,
rate_limit=5
)then set the Icinga plugin timeout slightly above the SDK retry timeout and this way you avoid Icinga killing the process while the SDK is still waiting/retrying.
One last thing for monitoring, system.lakeflow.job_run_timeline tracks job runs and metadata and system.query.history stores SQL warehouse or serverless query history and I think they are good in your case but don't forget that they are not real time, so I would still use the API for immediate active run checks.
If this answer resolves your question, could you please mark it as โAccept as Solutionโ? It will help other users quickly find the correct fix.
Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP