Greetings @c4ndy , Thanks for sharing the context and links. I couldnโt read the Databricks Community thread from the URL you provided; please verify the URL and access settings, as it may not yet be available to Glean or require additional permissions. If it was recently created, it might not be indexed yet.
Whatโs happening
The error string indicates the Provisioned Throughput validator is inferring your model as a Llama-family variant with a specific combination of parameters and context window that isnโt currently supported: โllama with 7B and 32768 context length.โ This kind of rejection is consistent with known behavior when the modelโs metadata suggests an unsupported architecture/size/context combo for PT. A similar case documented internally showed that reducing the modelโs configured context length (e.g., to 4k) allowed serving to proceed, which confirms the validatorโs sensitivity to the context-length setting rather than the actual parameter count of the model.
Your Bielik reference model is a 4.6โ5B class, LLaMA-like decoder-only model, fine-tuned for Polish; the model card does not explicitly state a 32k context, but it is described as LLaMA/Mistral-like and HF lists it as ~5B params, which suggests the PT validator may be misclassifying the architecture/size based on metadata or default config rather than your intended 4.5B profile.
Why PT rejects this configuration
Provisioned Throughput (PT) is optimized for select foundation model families and for specific runtime configurations (size and context window) that Databricks has tuned and supports. When a custom model is logged as a โLlamaโ-type for PT, the validator checks metadata like architecture and max position embeddings; if the combo falls outside supported ranges, the endpoint creation fails. This is consistent with the current model family support matrix and PT design (supported model families include Llama, Mistral, Mixtral, MPT; PT is recommended for production with performance guarantees).
Mitigations you can try
- Reduce the configured context length in the modelโs HF/MLflow metadata to a PT-supported sequence length (for example, 16k or 8k), then re-log and re-deploy under PT. In an internal case, lowering the context length allowed endpoint creation to succeed; in Databricksโ Qwen-as-Llama guidance, they explicitly set
max_position_embeddings to 16000 and adjusted config.json to Llama-compatible sequence lengths so PT would accept the model.
-
Ensure your HF config and MLflow logging reflect a Llama-compatible architecture only to the extent necessary for PT acceptance. The Qwen guide shows the pattern: align tokenizer and config.json fields to the Llama-serving expectations (architectures, sequence length), and avoid settings that imply unsupported variants (like 7B + 32k) if you donโt strictly need them for serving.
-
If you must keep a 32k context, consider serving via a custom model endpoint (serverless GPU) by not logging the model as an LLM for PT (for example, remove PT-specific task metadata). Note, however, this path is not recommended for LLMs due to performance limitations and lack of PT optimizations, per internal engineering guidance; use this only if long context is a hard requirement and PT constraints cannot be relaxed.
-
Alternatively, consider using a PT-supported family that natively supports long contexts. Llama 3.1 models supported by Databricks Foundation Model APIs have 128k context windows and are PT-optimized; for workloads needing long contexts and predictable performance, theyโre a good fit if switching models is feasible for your use case.
Practical next steps
- Inspect your HF config (
config.json) for:
architectures: should be Llama-compatible only if you intend to use PT with Llama serving runtime.
max_position_embeddings: set to a supported value (e.g., 16000 or 8192) if PT acceptance is the goal. This reduces maximum context but aligns with PTโs supported bounds.
-
Re-log with MLflow Transformers and re-serve:
- Keep standard logging, but avoid PT-only flags suggesting unsupported combinations.
- If you choose custom GPU serving (non-PT), remove PT-specific LLM task metadata and serve as a custom model. Expect lower performance relative to PT.
-
If long context is essential, validate whether moving to a Databricks-hosted foundation model with long context (e.g., Llama 3.1) would satisfy requirements while retaining PTโs scalability and reliability.
Notes on the Bielik model
page HF lists Bielik-4.5B-v3.0-Instruct as a generative text model around 4.6โ5B parameters, LLaMA/Mistral-like, and using a ChatML prompt format. If your local config or quantization introduced a 32k rope scaling or max position embeddings, PTโs validator could be inferring the unsupported 7B+32k profile from those fields even though the parameter count is ~4.5โ5B.
Hoping this helps, Louis.