<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Using Qwen with vLLM in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/using-qwen-with-vllm/m-p/154102#M4605</link>
    <description>&lt;P&gt;&lt;SPAN&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/226619"&gt;@pfzoz&lt;/a&gt;&amp;nbsp;-- the "Model architectures failed to be inspected" error you are hitting is a well-known compatibility issue between vLLM, the transformers library, and the Qwen2/2.5-VL model family. The root cause is that vLLM's model registry subprocess fails when it tries to import and validate the Qwen2&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;5&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;VLForConditionalGeneration architecture, often due to a torch.compile conflict during inspection. Here is how to work around it depending on your use case.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Root Cause: The Version Triangle&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The Qwen VL models require a specific alignment between &lt;/SPAN&gt;&lt;STRONG&gt;three&lt;/STRONG&gt;&lt;SPAN&gt; libraries:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Library&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Minimum Version for Qwen2.5-VL&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;vLLM&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;gt;= 0.7.2 (VL support landed via PR #12604)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;transformers&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;gt;= 4.49 (qwen2_5_vl architecture registered)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;flash-attn&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;2.7.x (matching CUDA + PyTorch)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;SPAN&gt;The Databricks serverless GPU tutorial pins &lt;/SPAN&gt;&lt;STRONG&gt;vllm==0.8.5.post1&lt;/STRONG&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;STRONG&gt;transformers&amp;lt;4.54.0&lt;/STRONG&gt;&lt;SPAN&gt;, which should include Qwen2.5-VL support. However, the pinned flash-attn wheel and numpy versions can still cause import-time failures in the model registry subprocess.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Fix 1: Correct Install Order on Serverless (Recommended First Try)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The install order matters because vLLM's setup can pull in conflicting transitive dependencies. Try this sequence in a fresh notebook:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 1: Flash attention first (pre-built wheel for the serverless runtime)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install --force-reinstall --no-cache-dir --no-deps \&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;"&lt;A href="https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl" target="_blank"&gt;https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl&lt;/A&gt;"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 2: Install vLLM with specific version&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install "vllm==0.8.5.post1"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 3: Force transformers to a compatible version AFTER vLLM&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# (vLLM may have pinned an older transformers as a dependency)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install "transformers&amp;gt;=4.49.0,&amp;lt;4.54.0"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 4: Qwen VL-specific dependencies&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install qwen-vl-utils accelerate&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 5: Other dependencies from the tutorial&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install "ray[data]&amp;gt;=2.47.1" "numpy==1.26.4" hf_transfer&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%restart_python&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key difference from the tutorial:&lt;/STRONG&gt;&lt;SPAN&gt; Install transformers &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;after&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; vLLM to prevent vLLM from overriding it with an older version that does not recognize qwen2&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;5&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;vl.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;After restart, verify the architecture resolves:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from transformers import AutoConfig&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", trust_remote_code=True)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;print(config.architectures)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Should output: ['Qwen2_5_VLForConditionalGeneration']&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Fix 2: Install transformers from Source (Nuclear Option)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If the PyPI version of transformers still does not recognize the qwen2&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;5&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;vl model type, install from the HuggingFace main branch:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install git+&lt;A href="https://github.com/huggingface/transformers" target="_blank"&gt;https://github.com/huggingface/transformers&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install qwen-vl-utils accelerate&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%restart_python&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;This ensures you have the latest model registration code. The downside is that bleeding-edge transformers can introduce other breaking changes with vLLM, so pin to a known-good commit if needed:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install git+&lt;A href="https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775" target="_blank"&gt;https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Fix 3: Use the Qwen Docker Image (Non-Serverless)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If you are open to using a classic GPU cluster instead of serverless, you can use Qwen's pre-built Docker image which bundles all compatible dependencies:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;docker pull qwenllm/qwenvl:2.5-cu121&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;On a Databricks cluster with a Docker-capable runtime, set this as the container image. This sidesteps all dependency conflicts.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Fix 4: Disable torch.compile During Model Inspection&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The underlying crash in many cases is torch.compile being triggered during model registry inspection. You can work around it by setting:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;import os&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;os.environ["VLLM_TORCH_COMPILE_LEVEL"] = "0"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Then initialize vLLM&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from vllm import LLM&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;model = LLM(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;model="Qwen/Qwen2.5-VL-7B-Instruct",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dtype="bfloat16",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;trust_remote_code=True,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;enforce_eager=True,&amp;nbsp; # Disables CUDA graphs which can also conflict&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;limit_mm_per_prompt={"image": 5, "video": 2},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The &lt;/SPAN&gt;&lt;STRONG&gt;enforce_eager=True&lt;/STRONG&gt;&lt;SPAN&gt; flag is important for VL models as CUDA graph capture can fail on the dynamic shapes used by vision encoders.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Alternative: Skip vLLM Entirely for VL-Instruct Models&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If your goal is inference (not high-throughput serving), you can avoid vLLM altogether and use the native transformers + Qwen pipeline, which has fewer dependency issues on serverless:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install transformers accelerate qwen-vl-utils torch&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%restart_python&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from qwen_vl_utils import process_vision_info&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;import torch&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;model = Qwen2_5_VLForConditionalGeneration.from_pretrained(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"Qwen/Qwen2.5-VL-7B-Instruct",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;torch_dtype=torch.bfloat16,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;device_map="auto",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;messages = [&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"role": "user",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"content": [&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{"type": "image", "image": "&lt;A href="https://example.com/image.jpg" target="_blank"&gt;https://example.com/image.jpg&lt;/A&gt;"},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{"type": "text", "text": "Describe this image."},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;],&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;image_inputs, video_inputs = process_vision_info(messages)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;inputs = processor(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;text=[text],&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;images=image_inputs,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;videos=video_inputs,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;padding=True,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;return_tensors="pt",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;).to(model.device)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;output_ids = model.generate(**inputs, max_new_tokens=512)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;output_text = processor.batch_decode(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;output_ids[:, inputs.input_ids.shape[1]:],&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;skip_special_tokens=True,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;print(output_text[0])&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;This approach works reliably on both serverless and classic GPU clusters. The tradeoff is lower throughput compared to vLLM's paged attention and continuous batching.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Summary&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Approach&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Complexity&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Best For&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Fix 1: Correct install order&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Low&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;First thing to try on serverless&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Fix 2: transformers from source&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Medium&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;When PyPI transformers is too old&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Fix 3: Docker image&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Low&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Classic GPU clusters (non-serverless)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Fix 4: Disable torch.compile&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Low&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Workaround for registry inspection crash&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Alternative: Native transformers&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Low&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Single-request inference, fewer dependencies&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;STRONG&gt;My recommendation:&lt;/STRONG&gt;&lt;SPAN&gt; Start with &lt;/SPAN&gt;&lt;STRONG&gt;Fix 1&lt;/STRONG&gt;&lt;SPAN&gt; (correct install order) combined with &lt;/SPAN&gt;&lt;STRONG&gt;Fix 4&lt;/STRONG&gt;&lt;SPAN&gt; (enforce_eager + disable torch.compile). If you are doing batch inference rather than online serving, the &lt;/SPAN&gt;&lt;STRONG&gt;native transformers approach&lt;/STRONG&gt;&lt;SPAN&gt; is the path of least resistance on Databricks serverless today.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;For text-only Qwen models (non-VL), the official Databricks tutorial should work out of the box since the standard Qwen architectures have been supported in vLLM for much longer.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Hope this helps! Let me know if you hit any other specific errors and I can help debug further.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;References&lt;/STRONG&gt;&lt;/H3&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://github.com/vllm-project/vllm/issues/12932" target="_blank"&gt;&lt;SPAN&gt;vLLM Issue #12932 - Qwen2.5-VL architecture inspection failure (fixed)&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://github.com/vllm-project/vllm/issues/12697" target="_blank"&gt;&lt;SPAN&gt;vLLM Issue #12697 - Transformers compatibility with Qwen2.5-VL&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://github.com/huggingface/transformers/issues/36292" target="_blank"&gt;&lt;SPAN&gt;HuggingFace transformers Issue #36292 - qwen2_5_vl not recognized&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://github.com/vllm-project/vllm/pull/12604" target="_blank"&gt;&lt;SPAN&gt;vLLM PR #12604 - Qwen2.5-VL support added&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://deepwiki.com/QwenLM/Qwen2.5-VL/5.3-vllm-deployment" target="_blank"&gt;&lt;SPAN&gt;Qwen2.5-VL vLLM Deployment Guide (DeepWiki)&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://docs.databricks.com/aws/en/machine-learning/sgc-examples/tutorials/sgc-raydata-vllm" target="_blank"&gt;&lt;SPAN&gt;Databricks Tutorial: Distributed inference with Ray Data + vLLM&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Sat, 11 Apr 2026 03:27:45 GMT</pubDate>
    <dc:creator>anuj_lathi</dc:creator>
    <dc:date>2026-04-11T03:27:45Z</dc:date>
    <item>
      <title>Using Qwen with vLLM</title>
      <link>https://community.databricks.com/t5/machine-learning/using-qwen-with-vllm/m-p/154081#M4603</link>
      <description>&lt;P&gt;There are many conflict and dependency issues when trying to install VLLM and use the Qwen models (on serverless), even the v2 families.&lt;BR /&gt;I tried following this guide&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/machine-learning/sgc-examples/tutorials/sgc-raydata-vllm" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/machine-learning/sgc-examples/tutorials/sgc-raydata-vllm&lt;/A&gt;&lt;BR /&gt;It lists very specific module versions to install. But VLLM normally fails to inspect the model architecture of Qwen. Is there an easier way to use these models (Specifically the VL-Instruct ones)?&lt;/P&gt;</description>
      <pubDate>Fri, 10 Apr 2026 15:11:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/using-qwen-with-vllm/m-p/154081#M4603</guid>
      <dc:creator>pfzoz</dc:creator>
      <dc:date>2026-04-10T15:11:29Z</dc:date>
    </item>
    <item>
      <title>Re: Using Qwen with vLLM</title>
      <link>https://community.databricks.com/t5/machine-learning/using-qwen-with-vllm/m-p/154102#M4605</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/226619"&gt;@pfzoz&lt;/a&gt;&amp;nbsp;-- the "Model architectures failed to be inspected" error you are hitting is a well-known compatibility issue between vLLM, the transformers library, and the Qwen2/2.5-VL model family. The root cause is that vLLM's model registry subprocess fails when it tries to import and validate the Qwen2&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;5&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;VLForConditionalGeneration architecture, often due to a torch.compile conflict during inspection. Here is how to work around it depending on your use case.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Root Cause: The Version Triangle&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The Qwen VL models require a specific alignment between &lt;/SPAN&gt;&lt;STRONG&gt;three&lt;/STRONG&gt;&lt;SPAN&gt; libraries:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Library&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Minimum Version for Qwen2.5-VL&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;vLLM&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;gt;= 0.7.2 (VL support landed via PR #12604)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;transformers&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;gt;= 4.49 (qwen2_5_vl architecture registered)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;flash-attn&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;2.7.x (matching CUDA + PyTorch)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;SPAN&gt;The Databricks serverless GPU tutorial pins &lt;/SPAN&gt;&lt;STRONG&gt;vllm==0.8.5.post1&lt;/STRONG&gt;&lt;SPAN&gt; and &lt;/SPAN&gt;&lt;STRONG&gt;transformers&amp;lt;4.54.0&lt;/STRONG&gt;&lt;SPAN&gt;, which should include Qwen2.5-VL support. However, the pinned flash-attn wheel and numpy versions can still cause import-time failures in the model registry subprocess.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Fix 1: Correct Install Order on Serverless (Recommended First Try)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The install order matters because vLLM's setup can pull in conflicting transitive dependencies. Try this sequence in a fresh notebook:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 1: Flash attention first (pre-built wheel for the serverless runtime)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install --force-reinstall --no-cache-dir --no-deps \&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;"&lt;A href="https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl" target="_blank"&gt;https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl&lt;/A&gt;"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 2: Install vLLM with specific version&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install "vllm==0.8.5.post1"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 3: Force transformers to a compatible version AFTER vLLM&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# (vLLM may have pinned an older transformers as a dependency)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install "transformers&amp;gt;=4.49.0,&amp;lt;4.54.0"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 4: Qwen VL-specific dependencies&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install qwen-vl-utils accelerate&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Step 5: Other dependencies from the tutorial&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install "ray[data]&amp;gt;=2.47.1" "numpy==1.26.4" hf_transfer&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%restart_python&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key difference from the tutorial:&lt;/STRONG&gt;&lt;SPAN&gt; Install transformers &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;after&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; vLLM to prevent vLLM from overriding it with an older version that does not recognize qwen2&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;5&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;vl.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;After restart, verify the architecture resolves:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from transformers import AutoConfig&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", trust_remote_code=True)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;print(config.architectures)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Should output: ['Qwen2_5_VLForConditionalGeneration']&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Fix 2: Install transformers from Source (Nuclear Option)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If the PyPI version of transformers still does not recognize the qwen2&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;5&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt;vl model type, install from the HuggingFace main branch:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install git+&lt;A href="https://github.com/huggingface/transformers" target="_blank"&gt;https://github.com/huggingface/transformers&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install qwen-vl-utils accelerate&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%restart_python&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;This ensures you have the latest model registration code. The downside is that bleeding-edge transformers can introduce other breaking changes with vLLM, so pin to a known-good commit if needed:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install git+&lt;A href="https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775" target="_blank"&gt;https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Fix 3: Use the Qwen Docker Image (Non-Serverless)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If you are open to using a classic GPU cluster instead of serverless, you can use Qwen's pre-built Docker image which bundles all compatible dependencies:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;docker pull qwenllm/qwenvl:2.5-cu121&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;On a Databricks cluster with a Docker-capable runtime, set this as the container image. This sidesteps all dependency conflicts.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Fix 4: Disable torch.compile During Model Inspection&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The underlying crash in many cases is torch.compile being triggered during model registry inspection. You can work around it by setting:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;import os&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;os.environ["VLLM_TORCH_COMPILE_LEVEL"] = "0"&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;# Then initialize vLLM&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from vllm import LLM&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;model = LLM(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;model="Qwen/Qwen2.5-VL-7B-Instruct",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dtype="bfloat16",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;trust_remote_code=True,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;enforce_eager=True,&amp;nbsp; # Disables CUDA graphs which can also conflict&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;limit_mm_per_prompt={"image": 5, "video": 2},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The &lt;/SPAN&gt;&lt;STRONG&gt;enforce_eager=True&lt;/STRONG&gt;&lt;SPAN&gt; flag is important for VL models as CUDA graph capture can fail on the dynamic shapes used by vision encoders.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Alternative: Skip vLLM Entirely for VL-Instruct Models&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;If your goal is inference (not high-throughput serving), you can avoid vLLM altogether and use the native transformers + Qwen pipeline, which has fewer dependency issues on serverless:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%pip install transformers accelerate qwen-vl-utils torch&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;%restart_python&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;from qwen_vl_utils import process_vision_info&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;import torch&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;model = Qwen2_5_VLForConditionalGeneration.from_pretrained(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"Qwen/Qwen2.5-VL-7B-Instruct",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;torch_dtype=torch.bfloat16,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;device_map="auto",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;messages = [&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"role": "user",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"content": [&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{"type": "image", "image": "&lt;A href="https://example.com/image.jpg" target="_blank"&gt;https://example.com/image.jpg&lt;/A&gt;"},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{"type": "text", "text": "Describe this image."},&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;],&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;image_inputs, video_inputs = process_vision_info(messages)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;inputs = processor(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;text=[text],&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;images=image_inputs,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;videos=video_inputs,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;padding=True,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;return_tensors="pt",&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;).to(model.device)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;output_ids = model.generate(**inputs, max_new_tokens=512)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;output_text = processor.batch_decode(&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;output_ids[:, inputs.input_ids.shape[1]:],&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;skip_special_tokens=True,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;print(output_text[0])&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;This approach works reliably on both serverless and classic GPU clusters. The tradeoff is lower throughput compared to vLLM's paged attention and continuous batching.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Summary&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Approach&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Complexity&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Best For&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Fix 1: Correct install order&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Low&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;First thing to try on serverless&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Fix 2: transformers from source&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Medium&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;When PyPI transformers is too old&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Fix 3: Docker image&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Low&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Classic GPU clusters (non-serverless)&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Fix 4: Disable torch.compile&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Low&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Workaround for registry inspection crash&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Alternative: Native transformers&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Low&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Single-request inference, fewer dependencies&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;STRONG&gt;My recommendation:&lt;/STRONG&gt;&lt;SPAN&gt; Start with &lt;/SPAN&gt;&lt;STRONG&gt;Fix 1&lt;/STRONG&gt;&lt;SPAN&gt; (correct install order) combined with &lt;/SPAN&gt;&lt;STRONG&gt;Fix 4&lt;/STRONG&gt;&lt;SPAN&gt; (enforce_eager + disable torch.compile). If you are doing batch inference rather than online serving, the &lt;/SPAN&gt;&lt;STRONG&gt;native transformers approach&lt;/STRONG&gt;&lt;SPAN&gt; is the path of least resistance on Databricks serverless today.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;For text-only Qwen models (non-VL), the official Databricks tutorial should work out of the box since the standard Qwen architectures have been supported in vLLM for much longer.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Hope this helps! Let me know if you hit any other specific errors and I can help debug further.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;———&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;References&lt;/STRONG&gt;&lt;/H3&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://github.com/vllm-project/vllm/issues/12932" target="_blank"&gt;&lt;SPAN&gt;vLLM Issue #12932 - Qwen2.5-VL architecture inspection failure (fixed)&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://github.com/vllm-project/vllm/issues/12697" target="_blank"&gt;&lt;SPAN&gt;vLLM Issue #12697 - Transformers compatibility with Qwen2.5-VL&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://github.com/huggingface/transformers/issues/36292" target="_blank"&gt;&lt;SPAN&gt;HuggingFace transformers Issue #36292 - qwen2_5_vl not recognized&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://github.com/vllm-project/vllm/pull/12604" target="_blank"&gt;&lt;SPAN&gt;vLLM PR #12604 - Qwen2.5-VL support added&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://deepwiki.com/QwenLM/Qwen2.5-VL/5.3-vllm-deployment" target="_blank"&gt;&lt;SPAN&gt;Qwen2.5-VL vLLM Deployment Guide (DeepWiki)&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;A href="https://docs.databricks.com/aws/en/machine-learning/sgc-examples/tutorials/sgc-raydata-vllm" target="_blank"&gt;&lt;SPAN&gt;Databricks Tutorial: Distributed inference with Ray Data + vLLM&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Sat, 11 Apr 2026 03:27:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/using-qwen-with-vllm/m-p/154102#M4605</guid>
      <dc:creator>anuj_lathi</dc:creator>
      <dc:date>2026-04-11T03:27:45Z</dc:date>
    </item>
  </channel>
</rss>

