cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using Qwen with vLLM

pfzoz
New Contributor

There are many conflict and dependency issues when trying to install VLLM and use the Qwen models (on serverless), even the v2 families.
I tried following this guide https://docs.databricks.com/aws/en/machine-learning/sgc-examples/tutorials/sgc-raydata-vllm
It lists very specific module versions to install. But VLLM normally fails to inspect the model architecture of Qwen. Is there an easier way to use these models (Specifically the VL-Instruct ones)?

1 REPLY 1

anuj_lathi
Databricks Employee
Databricks Employee

Hi @pfzoz -- the "Model architectures failed to be inspected" error you are hitting is a well-known compatibility issue between vLLM, the transformers library, and the Qwen2/2.5-VL model family. The root cause is that vLLM's model registry subprocess fails when it tries to import and validate the Qwen25VLForConditionalGeneration architecture, often due to a torch.compile conflict during inspection. Here is how to work around it depending on your use case.

———

Root Cause: The Version Triangle

The Qwen VL models require a specific alignment between three libraries:

 

Library

Minimum Version for Qwen2.5-VL

vLLM

>= 0.7.2 (VL support landed via PR #12604)

transformers

>= 4.49 (qwen2_5_vl architecture registered)

flash-attn

2.7.x (matching CUDA + PyTorch)

The Databricks serverless GPU tutorial pins vllm==0.8.5.post1 and transformers<4.54.0, which should include Qwen2.5-VL support. However, the pinned flash-attn wheel and numpy versions can still cause import-time failures in the model registry subprocess.

———

Fix 1: Correct Install Order on Serverless (Recommended First Try)

The install order matters because vLLM's setup can pull in conflicting transitive dependencies. Try this sequence in a fresh notebook:

# Step 1: Flash attention first (pre-built wheel for the serverless runtime)

%pip install --force-reinstall --no-cache-dir --no-deps \

  "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+c..."

 

# Step 2: Install vLLM with specific version

%pip install "vllm==0.8.5.post1"

 

# Step 3: Force transformers to a compatible version AFTER vLLM

# (vLLM may have pinned an older transformers as a dependency)

%pip install "transformers>=4.49.0,<4.54.0"

 

# Step 4: Qwen VL-specific dependencies

%pip install qwen-vl-utils accelerate

 

# Step 5: Other dependencies from the tutorial

%pip install "ray[data]>=2.47.1" "numpy==1.26.4" hf_transfer

 

%restart_python

 

Key difference from the tutorial: Install transformers after vLLM to prevent vLLM from overriding it with an older version that does not recognize qwen25vl.

After restart, verify the architecture resolves:

from transformers import AutoConfig

config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", trust_remote_code=True)

print(config.architectures)

# Should output: ['Qwen2_5_VLForConditionalGeneration']

 

———

Fix 2: Install transformers from Source (Nuclear Option)

If the PyPI version of transformers still does not recognize the qwen25vl model type, install from the HuggingFace main branch:

%pip install git+https://github.com/huggingface/transformers

%pip install qwen-vl-utils accelerate

%restart_python

 

This ensures you have the latest model registration code. The downside is that bleeding-edge transformers can introduce other breaking changes with vLLM, so pin to a known-good commit if needed:

%pip install git+https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775

 

———

Fix 3: Use the Qwen Docker Image (Non-Serverless)

If you are open to using a classic GPU cluster instead of serverless, you can use Qwen's pre-built Docker image which bundles all compatible dependencies:

docker pull qwenllm/qwenvl:2.5-cu121

 

On a Databricks cluster with a Docker-capable runtime, set this as the container image. This sidesteps all dependency conflicts.

———

Fix 4: Disable torch.compile During Model Inspection

The underlying crash in many cases is torch.compile being triggered during model registry inspection. You can work around it by setting:

import os

os.environ["VLLM_TORCH_COMPILE_LEVEL"] = "0"

 

# Then initialize vLLM

from vllm import LLM

model = LLM(

    model="Qwen/Qwen2.5-VL-7B-Instruct",

    dtype="bfloat16",

    trust_remote_code=True,

    enforce_eager=True,  # Disables CUDA graphs which can also conflict

    limit_mm_per_prompt={"image": 5, "video": 2},

)

 

The enforce_eager=True flag is important for VL models as CUDA graph capture can fail on the dynamic shapes used by vision encoders.

———

Alternative: Skip vLLM Entirely for VL-Instruct Models

If your goal is inference (not high-throughput serving), you can avoid vLLM altogether and use the native transformers + Qwen pipeline, which has fewer dependency issues on serverless:

%pip install transformers accelerate qwen-vl-utils torch

%restart_python

 

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

from qwen_vl_utils import process_vision_info

import torch

 

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(

    "Qwen/Qwen2.5-VL-7B-Instruct",

    torch_dtype=torch.bfloat16,

    device_map="auto",

)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

 

messages = [

    {

        "role": "user",

        "content": [

            {"type": "image", "image": "https://example.com/image.jpg"},

            {"type": "text", "text": "Describe this image."},

        ],

    }

]

 

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(

    text=[text],

    images=image_inputs,

    videos=video_inputs,

    padding=True,

    return_tensors="pt",

).to(model.device)

 

output_ids = model.generate(**inputs, max_new_tokens=512)

output_text = processor.batch_decode(

    output_ids[:, inputs.input_ids.shape[1]:],

    skip_special_tokens=True,

)

print(output_text[0])

 

This approach works reliably on both serverless and classic GPU clusters. The tradeoff is lower throughput compared to vLLM's paged attention and continuous batching.

———

Summary

 

Approach

Complexity

Best For

Fix 1: Correct install order

Low

First thing to try on serverless

Fix 2: transformers from source

Medium

When PyPI transformers is too old

Fix 3: Docker image

Low

Classic GPU clusters (non-serverless)

Fix 4: Disable torch.compile

Low

Workaround for registry inspection crash

Alternative: Native transformers

Low

Single-request inference, fewer dependencies

My recommendation: Start with Fix 1 (correct install order) combined with Fix 4 (enforce_eager + disable torch.compile). If you are doing batch inference rather than online serving, the native transformers approach is the path of least resistance on Databricks serverless today.

For text-only Qwen models (non-VL), the official Databricks tutorial should work out of the box since the standard Qwen architectures have been supported in vLLM for much longer.

Hope this helps! Let me know if you hit any other specific errors and I can help debug further.

———

References

Anuj Lathi
Solutions Engineer @ Databricks