Databricks Community

c4ndy · 3 weeks ago

Hi,
This question is tightly correlated with another discussion: Model deprecation issue while serving on Databrick... - Databricks Community - 131968

In a nutshell, I'm trying to serve the model which is based on llama architecture (deployed through mlflow tranformer-model logging), but it's not llama directly (at least not with the same params number).

The following error is thrown while trying to make provisioned endpoint:
“Provisioned Throughput is not supported for llama with 7b parameters and 32768 context length- please reach out to support@databricks.com !”

Is there a way to mitigate this error?
Seems like it's a metadata misassumption.

model ref: speakleash/Bielik-4.5B-v3.0-Instruct · Hugging Face

Thanks in advance.

Louis_Frolio · a week ago

Greetings @c4ndy , Thanks for sharing the context and links. I couldn’t read the Databricks Community thread from the URL you provided; please verify the URL and access settings, as it may not yet be available to Glean or require additional permissions. If it was recently created, it might not be indexed yet.

What’s happening

The error string indicates the Provisioned Throughput validator is inferring your model as a Llama-family variant with a specific combination of parameters and context window that isn’t currently supported: “llama with 7B and 32768 context length.” This kind of rejection is consistent with known behavior when the model’s metadata suggests an unsupported architecture/size/context combo for PT. A similar case documented internally showed that reducing the model’s configured context length (e.g., to 4k) allowed serving to proceed, which confirms the validator’s sensitivity to the context-length setting rather than the actual parameter count of the model.

Your Bielik reference model is a 4.6–5B class, LLaMA-like decoder-only model, fine-tuned for Polish; the model card does not explicitly state a 32k context, but it is described as LLaMA/Mistral-like and HF lists it as ~5B params, which suggests the PT validator may be misclassifying the architecture/size based on metadata or default config rather than your intended 4.5B profile.

Why PT rejects this configuration

Provisioned Throughput (PT) is optimized for select foundation model families and for specific runtime configurations (size and context window) that Databricks has tuned and supports. When a custom model is logged as a “Llama”-type for PT, the validator checks metadata like architecture and max position embeddings; if the combo falls outside supported ranges, the endpoint creation fails. This is consistent with the current model family support matrix and PT design (supported model families include Llama, Mistral, Mixtral, MPT; PT is recommended for production with performance guarantees).

Mitigations you can try

Reduce the configured context length in the model’s HF/MLflow metadata to a PT-supported sequence length (for example, 16k or 8k), then re-log and re-deploy under PT. In an internal case, lowering the context length allowed endpoint creation to succeed; in Databricks’ Qwen-as-Llama guidance, they explicitly set max_position_embeddings to 16000 and adjusted config.json to Llama-compatible sequence lengths so PT would accept the model.
Ensure your HF config and MLflow logging reflect a Llama-compatible architecture only to the extent necessary for PT acceptance. The Qwen guide shows the pattern: align tokenizer and config.json fields to the Llama-serving expectations (architectures, sequence length), and avoid settings that imply unsupported variants (like 7B + 32k) if you don’t strictly need them for serving.
If you must keep a 32k context, consider serving via a custom model endpoint (serverless GPU) by not logging the model as an LLM for PT (for example, remove PT-specific task metadata). Note, however, this path is not recommended for LLMs due to performance limitations and lack of PT optimizations, per internal engineering guidance; use this only if long context is a hard requirement and PT constraints cannot be relaxed.
Alternatively, consider using a PT-supported family that natively supports long contexts. Llama 3.1 models supported by Databricks Foundation Model APIs have 128k context windows and are PT-optimized; for workloads needing long contexts and predictable performance, they’re a good fit if switching models is feasible for your use case.

Practical next steps

Inspect your HF config (config.json) for:
architectures: should be Llama-compatible only if you intend to use PT with Llama serving runtime.
max_position_embeddings: set to a supported value (e.g., 16000 or 8192) if PT acceptance is the goal. This reduces maximum context but aligns with PT’s supported bounds.
Re-log with MLflow Transformers and re-serve:
- Keep standard logging, but avoid PT-only flags suggesting unsupported combinations.
- If you choose custom GPU serving (non-PT), remove PT-specific LLM task metadata and serve as a custom model. Expect lower performance relative to PT.
If long context is essential, validate whether moving to a Databricks-hosted foundation model with long context (e.g., Llama 3.1) would satisfy requirements while retaining PT’s scalability and reliability.

Notes on the Bielik model

page HF lists Bielik-4.5B-v3.0-Instruct as a generative text model around 4.6–5B parameters, LLaMA/Mistral-like, and using a ChatML prompt format. If your local config or quantization introduced a 32k rope scaling or max position embeddings, PT’s validator could be inferring the unsupported 7B+32k profile from those fields even though the parameter count is ~4.5–5B.

Hoping this helps, Louis.