Databricks Community

PiotrM · ‎10-08-2025

Hey,

We've been testing the ai_query (Azure Databricks here) on preconfigured model serving endpoints like

databricks-meta-llama-3-3-70b-instruct and the initial results look nice.

I'm trying to limit the number of requests that could be sent to those endpoints, so the cloud spend won't spiral out of control.

The AI gateway seems to have the capability to limit the tokens/queries per minute which would be exactly what we're looking for, but it seems to not affect the ai_query functions calling the endpoint, despite successfully limiting the requests from Rest API?.

Is it the intended behavior? If so, are there any other options to properly limit the usage of ai_query apart from being able to monitor it using system tables/logs?

Best regards,

Piotr

jamesl · ‎10-09-2025

Hey guys,

@PiotrM AI Gateway does not currently enforce rate limiting on ai_query batch inference workloads, it only provides usage tracking, which is called out in the docs on limitations.

For cost control, you could control permissions on the endpoint and/or do system table monitoring or sql alerts with something like:
```
SELECT
user_id,
endpoint_name,
SUM(num_tokens) AS total_tokens,
COUNT(*) AS total_requests,
MIN(request_time) AS first_request,
MAX(request_time) AS last_request
FROM system.serving.endpoint_usage
WHERE endpoint_name = '<your_endpoint_name>'
AND request_time >= CURRENT_DATE() -- adjust time window as needed
GROUP BY user_id, endpoint_name
ORDER BY total_tokens DESC;
```

I hope this helps. If this and the other replies resolve the issue for you, please use the "Accept as Solution" button to let us know!

-James

View solution in original post

BS_THE_ANALYST · ‎10-08-2025

Hey @PiotrM,

Firstly, have you checked the docs out for Managing Model Serving Endpoints?
https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/manage-serving-end...

I just had a read through. You can certainly set up budgets to monitor them, this can help with preventing costs spiralling! 🙂. I appreciate you've mentioned about the system tables.

This article seems really promising: https://docs.databricks.com/aws/en/ai-gateway/configure-ai-gateway-endpoints 👀🙂... (I'm certain we've got to be onto a winner with this)

If that doesn't quite cut the mustard, perhaps we could also look at the actual token usage per user. Perhaps this can be throttled somehow 🤔.

All the best,
BS

szymon_dybczak · ‎10-08-2025

Hi @PiotrM , @BS_THE_ANALYST ,

I guess that's the whole problem here. @PiotrM correctly identified and configured tool to achieve his goal - AI Gateway.
My guess is that the ai_gateway function internally uses some shortcut to communicate with the endpoint. That could explain why the rate limit works when you call the endpoint directly, but doesn’t when you use ai_gateway.

PiotrM · ‎10-08-2025

Hey,

@BS_THE_ANALYST, before writing that post, I went exactly through the docs you've posted. I wasn't able to find a specific confirmation (or denial) that this function will be affected by the rate limits, which led me to believe that it's worth a shot.

@szymon_dybczak Thank you. My guess exactly. On Azure it's still in Public Preview so maybe it'll be added in the future.

BR,

Piotr

szymon_dybczak · ‎10-09-2025

Yep, let's wait for a Databricks employee to join the discussion. Maybe they will shed some light on why it's not working as expected. You did everything correctly on your side. If the endpoint accessed via ai_query is not subject to the API rate limit, it should be clearly stated in the documentation.

jamesl · ‎10-09-2025

Hey guys,

@PiotrM AI Gateway does not currently enforce rate limiting on ai_query batch inference workloads, it only provides usage tracking, which is called out in the docs on limitations.

For cost control, you could control permissions on the endpoint and/or do system table monitoring or sql alerts with something like:
```
SELECT
user_id,
endpoint_name,
SUM(num_tokens) AS total_tokens,
COUNT(*) AS total_requests,
MIN(request_time) AS first_request,
MAX(request_time) AS last_request
FROM system.serving.endpoint_usage
WHERE endpoint_name = '<your_endpoint_name>'
AND request_time >= CURRENT_DATE() -- adjust time window as needed
GROUP BY user_id, endpoint_name
ORDER BY total_tokens DESC;
```

I hope this helps. If this and the other replies resolve the issue for you, please use the "Accept as Solution" button to let us know!

-James

szymon_dybczak · ‎10-09-2025

Hi @jamesl ,

Thanks for clarifying our doubts, that's exactly what we were looking for. Maybe it's a good idea to add small addition to AI Gateway documentation?

PiotrM · ‎10-09-2025

Hi @jamesl,

thank you very much. This resolves my question. This specific sentence in the AI Gateway docs may have gone over my head, but it's clear now.

BR,

Piotr

Databricks Community

ai_query not affected by AI gateway's rate limits?

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐