Provisioned Throughput is not supported for llama...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-24-2025 07:10 AM
Hi,
This question is tightly correlated with another discussion: Model deprecation issue while serving on Databrick... - Databricks Community - 131968
In a nutshell, I'm trying to serve the model which is based on llama architecture (deployed through mlflow tranformer-model logging), but it's not llama directly (at least not with the same params number).
The following error is thrown while trying to make provisioned endpoint:
“Provisioned Throughput is not supported for llama with 7b parameters and 32768 context length- please reach out to support@databricks.com !”
Is there a way to mitigate this error?
Seems like it's a metadata misassumption.
model ref: speakleash/Bielik-4.5B-v3.0-Instruct · Hugging Face
Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-06-2025 10:46 AM
Greetings @c4ndy , Thanks for sharing the context and links. I couldn’t read the Databricks Community thread from the URL you provided; please verify the URL and access settings, as it may not yet be available to Glean or require additional permissions. If it was recently created, it might not be indexed yet.
What’s happening
Why PT rejects this configuration
Mitigations you can try
- Reduce the configured context length in the model’s HF/MLflow metadata to a PT-supported sequence length (for example, 16k or 8k), then re-log and re-deploy under PT. In an internal case, lowering the context length allowed endpoint creation to succeed; in Databricks’ Qwen-as-Llama guidance, they explicitly set
max_position_embeddingsto 16000 and adjustedconfig.jsonto Llama-compatible sequence lengths so PT would accept the model. -
Ensure your HF config and MLflow logging reflect a Llama-compatible architecture only to the extent necessary for PT acceptance. The Qwen guide shows the pattern: align tokenizer and
config.jsonfields to the Llama-serving expectations (architectures, sequence length), and avoid settings that imply unsupported variants (like 7B + 32k) if you don’t strictly need them for serving. -
If you must keep a 32k context, consider serving via a custom model endpoint (serverless GPU) by not logging the model as an LLM for PT (for example, remove PT-specific task metadata). Note, however, this path is not recommended for LLMs due to performance limitations and lack of PT optimizations, per internal engineering guidance; use this only if long context is a hard requirement and PT constraints cannot be relaxed.
-
Alternatively, consider using a PT-supported family that natively supports long contexts. Llama 3.1 models supported by Databricks Foundation Model APIs have 128k context windows and are PT-optimized; for workloads needing long contexts and predictable performance, they’re a good fit if switching models is feasible for your use case.
Practical next steps
- Inspect your HF config (
config.json) for: architectures: should be Llama-compatible only if you intend to use PT with Llama serving runtime.max_position_embeddings: set to a supported value (e.g., 16000 or 8192) if PT acceptance is the goal. This reduces maximum context but aligns with PT’s supported bounds.-
Re-log with MLflow Transformers and re-serve:
- Keep standard logging, but avoid PT-only flags suggesting unsupported combinations.
- If you choose custom GPU serving (non-PT), remove PT-specific LLM task metadata and serve as a custom model. Expect lower performance relative to PT.
-
If long context is essential, validate whether moving to a Databricks-hosted foundation model with long context (e.g., Llama 3.1) would satisfy requirements while retaining PT’s scalability and reliability.