cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

ai_query not affected by AI gateway's rate limits?

PiotrM
New Contributor III

Hey, 

We've been testing the ai_query (Azure Databricks here) on preconfigured model serving endpoints like

databricks-meta-llama-3-3-70b-instruct and the initial results look nice.
 
I'm trying to limit the number of requests that could be sent to those endpoints, so the cloud spend won't spiral out of control.
 
The AI gateway seems to have the capability to limit the tokens/queries per minute which would be exactly what we're looking for, but it seems to not affect the ai_query functions calling the endpoint, despite successfully limiting the requests from Rest API?.
 
Is it the intended behavior? If so, are there any other options to properly limit the usage of ai_query apart from being able to monitor it using system tables/logs? 
 
Best regards, 
Piotr

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hey guys,

@PiotrM AI Gateway does not currently enforce rate limiting on ai_query batch inference workloads, it only provides usage tracking, which is called out in the docs on limitations

For cost control, you could control permissions on the endpoint and/or do system table monitoring or sql alerts with something like: 
```
SELECT
user_id,
endpoint_name,
SUM(num_tokens) AS total_tokens,
COUNT(*) AS total_requests,
MIN(request_time) AS first_request,
MAX(request_time) AS last_request
FROM system.serving.endpoint_usage
WHERE endpoint_name = '<your_endpoint_name>'
AND request_time >= CURRENT_DATE() -- adjust time window as needed
GROUP BY user_id, endpoint_name
ORDER BY total_tokens DESC;
```

I hope this helps. If this and the other replies resolve the issue for you, please use the "Accept as Solution" button to let us know!

-James

View solution in original post

7 REPLIES 7

BS_THE_ANALYST
Esteemed Contributor II

Hey @PiotrM,

Firstly, have you checked the docs out for Managing Model Serving Endpoints? 
https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/manage-serving-end... 

I just had a read through. You can certainly set up budgets to monitor them, this can help with preventing costs spiralling! 🙂. I appreciate you've mentioned about the system tables. 

This article seems really promising: https://docs.databricks.com/aws/en/ai-gateway/configure-ai-gateway-endpoints 👀🙂... (I'm certain we've got to be onto a winner with this)

BS_THE_ANALYST_1-1759953536854.png

BS_THE_ANALYST_0-1759953226614.png

If that doesn't quite cut the mustard, perhaps we could also look at the actual token usage per user. Perhaps this can be throttled somehow 🤔.

All the best,
BS

Hi @PiotrM , @BS_THE_ANALYST ,

I guess that's the whole problem here. @PiotrM correctly identified and configured tool to achieve his goal - AI Gateway.
My guess is that the ai_gateway function internally uses some shortcut to communicate with the endpoint. That could explain why the rate limit works when you call the endpoint directly, but doesn’t when you use ai_gateway.

PiotrM
New Contributor III

Hey, 

@BS_THE_ANALYST, before writing that post, I went exactly through the docs you've posted. I wasn't able to find a specific confirmation (or denial) that this function will be affected by the rate limits, which led me to believe that it's worth a shot.

@szymon_dybczak Thank you. My guess exactly. On Azure it's still in Public Preview so maybe it'll be added in the future. 

BR, 

Piotr

szymon_dybczak
Esteemed Contributor III

Yep, let's wait for a Databricks employee to join the discussion. Maybe they will shed some light on why it's not working as expected. You did everything correctly on your side. If the endpoint accessed via ai_query is not subject to the API rate limit, it should be clearly stated in the documentation.

Hey guys,

@PiotrM AI Gateway does not currently enforce rate limiting on ai_query batch inference workloads, it only provides usage tracking, which is called out in the docs on limitations

For cost control, you could control permissions on the endpoint and/or do system table monitoring or sql alerts with something like: 
```
SELECT
user_id,
endpoint_name,
SUM(num_tokens) AS total_tokens,
COUNT(*) AS total_requests,
MIN(request_time) AS first_request,
MAX(request_time) AS last_request
FROM system.serving.endpoint_usage
WHERE endpoint_name = '<your_endpoint_name>'
AND request_time >= CURRENT_DATE() -- adjust time window as needed
GROUP BY user_id, endpoint_name
ORDER BY total_tokens DESC;
```

I hope this helps. If this and the other replies resolve the issue for you, please use the "Accept as Solution" button to let us know!

-James

szymon_dybczak
Esteemed Contributor III

Hi @jamesl ,

Thanks for clarifying our doubts, that's exactly what we were looking for. Maybe it's a good idea to add small addition to AI Gateway documentation?

PiotrM
New Contributor III

Hi @jamesl,

thank you very much. This resolves my question. This specific sentence in the AI Gateway docs may have gone over my head, but it's clear now.

BR, 

Piotr

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now