Greetings @itssb , I did some digging and here is what I found:
What you are seeing is a Databricks-imposed rate limit of 0, and that setting takes precedence over the endpoint- or user-level rate limits you configured in the UI. In other words, even if you set non-zero QPM or TPM values in Serving or AI Gateway, those settings will not override this restriction.
This is expected behavior for certain high-demand hosted models, including GPT-5.x and some Claude variants, when used from trial or Free Edition workspaces. In those cases, the workspace is often placed in a TRIAL_VERIFIED trust tier, which can block or heavily restrict access to premium models regardless of the limits shown in the UI.
The key point is this: the โrate limit of 0โ error is not something that can be fixed by adjusting endpoint settings. It reflects a workspace-level access restriction for that model.
The path forward is one of the following:
Once the workspace is moved to a PAYABLE_VERIFIED tier, this Databricks-set rate limit of 0 typically disappears, and the same endpoint will often begin working without any additional UI changes.
In the meantime, the practical workaround is to use open-source or otherwise non-gated models, such as Llama, which are not subject to this specific Databricks-imposed 0-rate-limit restriction.
Hope this helps, Louis.