topic Re: ai_query not affected by AI gateway's rate limits? in Generative AI

ai_query not affected by AI gateway's rate limits?

PiotrM — Wed, 08 Oct 2025 17:58:40 GMT

Hey,

We've been testing the ai_query (Azure Databricks here) on preconfigured model serving endpoints like

databricks-meta-llama-3-3-70b-instruct and the initial results look nice.

I'm trying to limit the number of requests that could be sent to those endpoints, so the cloud spend won't spiral out of control.

The AI gateway seems to have the capability to limit the tokens/queries per minute which would be exactly what we're looking for, but it seems to not affect the ai_query functions calling the endpoint, despite successfully limiting the requests from Rest API?.

Is it the intended behavior? If so, are there any other options to properly limit the usage of ai_query apart from being able to monitor it using system tables/logs?

Best regards,

Piotr

Re: ai_query not affected by AI gateway's rate limits?

BS_THE_ANALYST — Wed, 08 Oct 2025 19:59:26 GMT

Hey @PiotrM,

Firstly, have you checked the docs out for Managing Model Serving Endpoints?
https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/manage-serving-endpoints

I just had a read through. You can certainly set up budgets to monitor them, this can help with preventing costs spiralling! 🙂. I appreciate you've mentioned about the system tables.

This article seems really promising: https://docs.databricks.com/aws/en/ai-gateway/configure-ai-gateway-endpoints 👀🙂... (I'm certain we've got to be onto a winner with this)

If that doesn't quite cut the mustard, perhaps we could also look at the actual token usage per user. Perhaps this can be throttled somehow 🤔.

All the best,
BS

Re: ai_query not affected by AI gateway's rate limits?

szymon_dybczak — Thu, 09 Oct 2025 06:02:50 GMT

Hi @PiotrM , @BS_THE_ANALYST ,

I guess that's the whole problem here. @PiotrM correctly identified and configured tool to achieve his goal - AI Gateway.
My guess is that the ai_gateway function internally uses some shortcut to communicate with the endpoint. That could explain why the rate limit works when you call the endpoint directly, but doesn’t when you use ai_gateway.

Re: ai_query not affected by AI gateway's rate limits?

PiotrM — Thu, 09 Oct 2025 06:57:15 GMT

Hey,

@BS_THE_ANALYST, before writing that post, I went exactly through the docs you've posted. I wasn't able to find a specific confirmation (or denial) that this function will be affected by the rate limits, which led me to believe that it's worth a shot.

@szymon_dybczak Thank you. My guess exactly. On Azure it's still in Public Preview so maybe it'll be added in the future.

BR,

Piotr

Re: ai_query not affected by AI gateway's rate limits?

szymon_dybczak — Thu, 09 Oct 2025 07:04:09 GMT

Yep, let's wait for a Databricks employee to join the discussion. Maybe they will shed some light on why it's not working as expected. You did everything correctly on your side. If the endpoint accessed via ai_query is not subject to the API rate limit, it should be clearly stated in the documentation.

Re: ai_query not affected by AI gateway's rate limits?

jamesl — Thu, 09 Oct 2025 15:53:05 GMT

Hey guys,

@PiotrM AI Gateway does not currently enforce rate limiting on ai_query batch inference workloads, it only provides usage tracking, which is called out in the docs on limitations.

For cost control, you could control permissions on the endpoint and/or do system table monitoring or sql alerts with something like:
```
SELECT
user_id,
endpoint_name,
SUM(num_tokens) AS total_tokens,
COUNT(*) AS total_requests,
MIN(request_time) AS first_request,
MAX(request_time) AS last_request
FROM system.serving.endpoint_usage
WHERE endpoint_name = '<your_endpoint_name>'
AND request_time >= CURRENT_DATE() -- adjust time window as needed
GROUP BY user_id, endpoint_name
ORDER BY total_tokens DESC;
```

I hope this helps. If this and the other replies resolve the issue for you, please use the "Accept as Solution" button to let us know!

-James

Re: ai_query not affected by AI gateway's rate limits?

szymon_dybczak — Thu, 09 Oct 2025 16:33:30 GMT

Hi @jamesl ,

Thanks for clarifying our doubts, that's exactly what we were looking for. Maybe it's a good idea to add small addition to AI Gateway documentation?

Re: ai_query not affected by AI gateway's rate limits?

PiotrM — Thu, 09 Oct 2025 19:48:52 GMT

Hi @jamesl,

thank you very much. This resolves my question. This specific sentence in the AI Gateway docs may have gone over my head, but it's clear now.

BR,

Piotr