Re: Model Serving Endpoint: Cuda-OOM for Custom Mo...

sarahbhord · ‎05-07-2025

Here are some suggestions:

1. Update coda.yaml. Replace the current config with this optimized version:

channels:
  - conda-forge
dependencies:
  - python=3.10  # 3.12 may cause compatibility issues
  - pip
  - pip:
    - mlflow==2.21.3
    - torch==2.2.1  # Align with CUDA 12.1
    - transformers==4.40.0  # Latest stable for multi-GPU
    - accelerate==0.29.0  # Critical for device_map="auto"
    - bitsandbytes==0.43.0  # For 8/4-bit quantization
    - xformers==0.0.25  # Memory-efficient attention
name: chatts-env
variables:
  MLFLOW_HUGGINGFACE_DEVICE_MAP_STRATEGY: auto  # Not "sequential"

2. Model loading fixes. In your MLFlow model's inference script, enforce multi-GPU distribution.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def load_model(model_path):
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",  # Distribute across all GPUs
        torch_dtype=torch.float16,  # 16-bit
        trust_remote_code=True,
        low_cpu_mem_usage=True  # Reduce CPU RAM pressure
    )
    return model

3. Serving endpoint configuration. Use this JSON payload when creating the endpoint to ensure tensor parallelism.

{
  "name": "qwen-chat-endpoint",
  "config": {
    "served_entities": [{
      "entity_name": "catalog.schema.model_name",
      "entity_version": "1",
      "workload_type": "GPU_LARGE",  # Use A100 GPUs (80GB each)
      "workload_size": "Large",      # 4xGPUs
      "task": "llm/v1/completions",
      "environment_vars": {
        "HF_HOME": "/dbfs/huggingface",
        "MAX_JOBS": "4"  # Parallelize model loading
      }
    }]
  }
}

4. Other adjustments:

Avoid T4 GPUs: They only have 16GB each. use A100 instances with 40GB/GPU.
Quantize further: Add load_in_8bit=True to your model loading code if 16-bit isnt enough.
Check layer splitting: if device_map="auto"" fails, manually specify no_split_module_classes for QWEN's architecture.

If the error persists, share the full CUDA OOM log to debug layer-specific memory issues.