Using Qwen with vLLM

pfzoz — Fri, 10 Apr 2026 15:11:29 GMT

There are many conflict and dependency issues when trying to install VLLM and use the Qwen models (on serverless), even the v2 families.
I tried following this guide https://docs.databricks.com/aws/en/machine-learning/sgc-examples/tutorials/sgc-raydata-vllm
It lists very specific module versions to install. But VLLM normally fails to inspect the model architecture of Qwen. Is there an easier way to use these models (Specifically the VL-Instruct ones)?

Re: Using Qwen with vLLM

anuj_lathi — Sat, 11 Apr 2026 03:27:45 GMT

Hi @pfzoz -- the "Model architectures failed to be inspected" error you are hitting is a well-known compatibility issue between vLLM, the transformers library, and the Qwen2/2.5-VL model family. The root cause is that vLLM's model registry subprocess fails when it tries to import and validate the Qwen25VLForConditionalGeneration architecture, often due to a torch.compile conflict during inspection. Here is how to work around it depending on your use case.

———

Root Cause: The Version Triangle

The Qwen VL models require a specific alignment between three libraries:

Library	Minimum Version for Qwen2.5-VL
vLLM	>= 0.7.2 (VL support landed via PR #12604)
transformers	>= 4.49 (qwen2_5_vl architecture registered)
flash-attn	2.7.x (matching CUDA + PyTorch)

The Databricks serverless GPU tutorial pins vllm==0.8.5.post1 and transformers<4.54.0, which should include Qwen2.5-VL support. However, the pinned flash-attn wheel and numpy versions can still cause import-time failures in the model registry subprocess.

———

Fix 1: Correct Install Order on Serverless (Recommended First Try)

The install order matters because vLLM's setup can pull in conflicting transitive dependencies. Try this sequence in a fresh notebook:

# Step 1: Flash attention first (pre-built wheel for the serverless runtime)

%pip install --force-reinstall --no-cache-dir --no-deps \

"https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl"

# Step 2: Install vLLM with specific version

%pip install "vllm==0.8.5.post1"

# Step 3: Force transformers to a compatible version AFTER vLLM

# (vLLM may have pinned an older transformers as a dependency)

%pip install "transformers>=4.49.0,<4.54.0"

# Step 4: Qwen VL-specific dependencies

%pip install qwen-vl-utils accelerate

# Step 5: Other dependencies from the tutorial

%pip install "ray[data]>=2.47.1" "numpy==1.26.4" hf_transfer

%restart_python

Key difference from the tutorial: Install transformers after vLLM to prevent vLLM from overriding it with an older version that does not recognize qwen25vl.

After restart, verify the architecture resolves:

from transformers import AutoConfig

config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", trust_remote_code=True)

print(config.architectures)

# Should output: ['Qwen2_5_VLForConditionalGeneration']

———

Fix 2: Install transformers from Source (Nuclear Option)

If the PyPI version of transformers still does not recognize the qwen25vl model type, install from the HuggingFace main branch:

%pip install git+https://github.com/huggingface/transformers

%pip install qwen-vl-utils accelerate

%restart_python

This ensures you have the latest model registration code. The downside is that bleeding-edge transformers can introduce other breaking changes with vLLM, so pin to a known-good commit if needed:

%pip install git+https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775

———

Fix 3: Use the Qwen Docker Image (Non-Serverless)

If you are open to using a classic GPU cluster instead of serverless, you can use Qwen's pre-built Docker image which bundles all compatible dependencies:

docker pull qwenllm/qwenvl:2.5-cu121

On a Databricks cluster with a Docker-capable runtime, set this as the container image. This sidesteps all dependency conflicts.

———

Fix 4: Disable torch.compile During Model Inspection

The underlying crash in many cases is torch.compile being triggered during model registry inspection. You can work around it by setting:

import os

os.environ["VLLM_TORCH_COMPILE_LEVEL"] = "0"

# Then initialize vLLM

from vllm import LLM

model = LLM(

model="Qwen/Qwen2.5-VL-7B-Instruct",

dtype="bfloat16",

trust_remote_code=True,

enforce_eager=True, # Disables CUDA graphs which can also conflict

limit_mm_per_prompt={"image": 5, "video": 2},

)

The enforce_eager=True flag is important for VL models as CUDA graph capture can fail on the dynamic shapes used by vision encoders.

———

Alternative: Skip vLLM Entirely for VL-Instruct Models

If your goal is inference (not high-throughput serving), you can avoid vLLM altogether and use the native transformers + Qwen pipeline, which has fewer dependency issues on serverless:

%pip install transformers accelerate qwen-vl-utils torch

%restart_python

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

from qwen_vl_utils import process_vision_info

import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(

"Qwen/Qwen2.5-VL-7B-Instruct",

torch_dtype=torch.bfloat16,

device_map="auto",

)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [

{

"role": "user",

"content": [

{"type": "image", "image": "https://example.com/image.jpg"},

{"type": "text", "text": "Describe this image."},

}

]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(

text=[text],

images=image_inputs,

videos=video_inputs,

padding=True,

return_tensors="pt",

).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)

output_text = processor.batch_decode(

output_ids[:, inputs.input_ids.shape[1]:],

skip_special_tokens=True,

)

print(output_text[0])

This approach works reliably on both serverless and classic GPU clusters. The tradeoff is lower throughput compared to vLLM's paged attention and continuous batching.

———

Summary

Approach	Complexity	Best For
Fix 1: Correct install order	Low	First thing to try on serverless
Fix 2: transformers from source	Medium	When PyPI transformers is too old
Fix 3: Docker image	Low	Classic GPU clusters (non-serverless)
Fix 4: Disable torch.compile	Low	Workaround for registry inspection crash
Alternative: Native transformers	Low	Single-request inference, fewer dependencies

My recommendation: Start with Fix 1 (correct install order) combined with Fix 4 (enforce_eager + disable torch.compile). If you are doing batch inference rather than online serving, the native transformers approach is the path of least resistance on Databricks serverless today.

For text-only Qwen models (non-VL), the official Databricks tutorial should work out of the box since the standard Qwen architectures have been supported in vLLM for much longer.

Hope this helps! Let me know if you hit any other specific errors and I can help debug further.

———

topic Re: Using Qwen with vLLM in Machine Learning

Using Qwen with vLLM

Re: Using Qwen with vLLM

Root Cause: The Version Triangle

Fix 1: Correct Install Order on Serverless (Recommended First Try)

Fix 2: Install transformers from Source (Nuclear Option)

Fix 3: Use the Qwen Docker Image (Non-Serverless)

Fix 4: Disable torch.compile During Model Inspection

Alternative: Skip vLLM Entirely for VL-Instruct Models

Summary

References