Hi @pfzoz -- the "Model architectures failed to be inspected" error you are hitting is a well-known compatibility issue between vLLM, the transformers library, and the Qwen2/2.5-VL model family. The root cause is that vLLM's model registry subprocess fails when it tries to import and validate the Qwen25VLForConditionalGeneration architecture, often due to a torch.compile conflict during inspection. Here is how to work around it depending on your use case.
———
Root Cause: The Version Triangle
The Qwen VL models require a specific alignment between three libraries:
|
Library
|
Minimum Version for Qwen2.5-VL
|
|
vLLM
|
>= 0.7.2 (VL support landed via PR #12604)
|
|
transformers
|
>= 4.49 (qwen2_5_vl architecture registered)
|
|
flash-attn
|
2.7.x (matching CUDA + PyTorch)
|
The Databricks serverless GPU tutorial pins vllm==0.8.5.post1 and transformers<4.54.0, which should include Qwen2.5-VL support. However, the pinned flash-attn wheel and numpy versions can still cause import-time failures in the model registry subprocess.
———
Fix 1: Correct Install Order on Serverless (Recommended First Try)
The install order matters because vLLM's setup can pull in conflicting transitive dependencies. Try this sequence in a fresh notebook:
# Step 1: Flash attention first (pre-built wheel for the serverless runtime)
%pip install --force-reinstall --no-cache-dir --no-deps \
"https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+c..."
# Step 2: Install vLLM with specific version
%pip install "vllm==0.8.5.post1"
# Step 3: Force transformers to a compatible version AFTER vLLM
# (vLLM may have pinned an older transformers as a dependency)
%pip install "transformers>=4.49.0,<4.54.0"
# Step 4: Qwen VL-specific dependencies
%pip install qwen-vl-utils accelerate
# Step 5: Other dependencies from the tutorial
%pip install "ray[data]>=2.47.1" "numpy==1.26.4" hf_transfer
%restart_python
Key difference from the tutorial: Install transformers after vLLM to prevent vLLM from overriding it with an older version that does not recognize qwen25vl.
After restart, verify the architecture resolves:
from transformers import AutoConfig
config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", trust_remote_code=True)
print(config.architectures)
# Should output: ['Qwen2_5_VLForConditionalGeneration']
———
Fix 2: Install transformers from Source (Nuclear Option)
If the PyPI version of transformers still does not recognize the qwen25vl model type, install from the HuggingFace main branch:
%pip install git+https://github.com/huggingface/transformers
%pip install qwen-vl-utils accelerate
%restart_python
This ensures you have the latest model registration code. The downside is that bleeding-edge transformers can introduce other breaking changes with vLLM, so pin to a known-good commit if needed:
%pip install git+https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775
———
Fix 3: Use the Qwen Docker Image (Non-Serverless)
If you are open to using a classic GPU cluster instead of serverless, you can use Qwen's pre-built Docker image which bundles all compatible dependencies:
docker pull qwenllm/qwenvl:2.5-cu121
On a Databricks cluster with a Docker-capable runtime, set this as the container image. This sidesteps all dependency conflicts.
———
Fix 4: Disable torch.compile During Model Inspection
The underlying crash in many cases is torch.compile being triggered during model registry inspection. You can work around it by setting:
import os
os.environ["VLLM_TORCH_COMPILE_LEVEL"] = "0"
# Then initialize vLLM
from vllm import LLM
model = LLM(
model="Qwen/Qwen2.5-VL-7B-Instruct",
dtype="bfloat16",
trust_remote_code=True,
enforce_eager=True, # Disables CUDA graphs which can also conflict
limit_mm_per_prompt={"image": 5, "video": 2},
)
The enforce_eager=True flag is important for VL models as CUDA graph capture can fail on the dynamic shapes used by vision encoders.
———
Alternative: Skip vLLM Entirely for VL-Instruct Models
If your goal is inference (not high-throughput serving), you can avoid vLLM altogether and use the native transformers + Qwen pipeline, which has fewer dependency issues on serverless:
%pip install transformers accelerate qwen-vl-utils torch
%restart_python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://example.com/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=512)
output_text = processor.batch_decode(
output_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True,
)
print(output_text[0])
This approach works reliably on both serverless and classic GPU clusters. The tradeoff is lower throughput compared to vLLM's paged attention and continuous batching.
———
Summary
|
Approach
|
Complexity
|
Best For
|
|
Fix 1: Correct install order
|
Low
|
First thing to try on serverless
|
|
Fix 2: transformers from source
|
Medium
|
When PyPI transformers is too old
|
|
Fix 3: Docker image
|
Low
|
Classic GPU clusters (non-serverless)
|
|
Fix 4: Disable torch.compile
|
Low
|
Workaround for registry inspection crash
|
|
Alternative: Native transformers
|
Low
|
Single-request inference, fewer dependencies
|
My recommendation: Start with Fix 1 (correct install order) combined with Fix 4 (enforce_eager + disable torch.compile). If you are doing batch inference rather than online serving, the native transformers approach is the path of least resistance on Databricks serverless today.
For text-only Qwen models (non-VL), the official Databricks tutorial should work out of the box since the standard Qwen architectures have been supported in vLLM for much longer.
Hope this helps! Let me know if you hit any other specific errors and I can help debug further.
———
References
Anuj Lathi
Solutions Engineer @ Databricks