topic Model Serving Endpoint: Cuda-OOM for Custom Model in Machine Learning

Model Serving Endpoint: Cuda-OOM for Custom Model

DaPo — Wed, 07 May 2025 13:23:29 GMT

Hello all,

I am tasked to evaluate a new LLM for some use-cases. In particular, I need to build a POC for a chat bot based on that model. To that end, I want to create a custom Serving Endpoint for an LLM pulled from huggingfaces. The model itself is based on QWEN (here is the model I need to use). So far, I logged the model into our ml-flow tracking (with the transformers model-flavor and registered it in our Unity-Catalog. Now I want to create a model serving endpoint, using a 4xGPUs instance. To my understanding, the GPUs in total have enough GPU-Memory for the model, and the libraries in use should handle this setup, distributing the model on multiple GPUs (see down below). However, creation of the model endpoint fails, with a CUDA-OOM error showing up in the logs. (The Model would need ~30GB of Memory, more than one of the GPUs, but far less than the available memory across all four GPUs.) Note, that for a lower memory footprint, I have already saved the model with 16bit floating point precession.

What can I do, to avoid the error?

Some additional context:

I did load the model in a Notebook on a similar sized All-Purpose-Compute Cluster (based on a single node gdn.12xlarge instance, wich has 4xGPUs of type T4).
Below, you see the content of the conda.yaml file in the mlflow artifacts.
Note, that based on my experiments in the notebook: For some reason, the default device_map strategy would not work, producing CUDA-OOM errors. I had to specify it explicitly. I added the appropriate variable to the conda env.

channels:
  - conda-forge 
dependencies: 
  - python=3.12.3
  - pip
  - pip:
    - mlflow==2.21.3
    - accelerate==1.5.2
    - torch==2.6.0
    - torchvision==0.21.0
    - transformers==4.51.3
name: chatts-env
variables: MLFLOW_HUGGINGFACE_DEVICE_MAP_STRATEGY: sequential

Re: Model Serving Endpoint: Cuda-OOM for Custom Model

sarahbhord — Wed, 07 May 2025 16:52:42 GMT

Here are some suggestions:

1. Update coda.yaml. Replace the current config with this optimized version:

channels: - conda-forge dependencies: - python=3.10 # 3.12 may cause compatibility issues - pip - pip: - mlflow==2.21.3 - torch==2.2.1 # Align with CUDA 12.1 - transformers==4.40.0 # Latest stable for multi-GPU - accelerate==0.29.0 # Critical for device_map="auto" - bitsandbytes==0.43.0 # For 8/4-bit quantization - xformers==0.0.25 # Memory-efficient attention name: chatts-env variables: MLFLOW_HUGGINGFACE_DEVICE_MAP_STRATEGY: auto # Not "sequential"

2. Model loading fixes. In your MLFlow model's inference script, enforce multi-GPU distribution.

from transformers import AutoModelForCausalLM, AutoTokenizer import torch def load_model(model_path): model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", # Distribute across all GPUs torch_dtype=torch.float16, # 16-bit trust_remote_code=True, low_cpu_mem_usage=True # Reduce CPU RAM pressure ) return model

3. Serving endpoint configuration. Use this JSON payload when creating the endpoint to ensure tensor parallelism.

{ "name": "qwen-chat-endpoint", "config": { "served_entities": [{ "entity_name": "catalog.schema.model_name", "entity_version": "1", "workload_type": "GPU_LARGE", # Use A100 GPUs (80GB each) "workload_size": "Large", # 4xGPUs "task": "llm/v1/completions", "environment_vars": { "HF_HOME": "/dbfs/huggingface", "MAX_JOBS": "4" # Parallelize model loading } }] } }

4. Other adjustments:

Avoid T4 GPUs: They only have 16GB each. use A100 instances with 40GB/GPU.
Quantize further: Add load_in_8bit=True to your model loading code if 16-bit isnt enough.
Check layer splitting: if device_map="auto"" fails, manually specify no_split_module_classes for QWEN's architecture.

If the error persists, share the full CUDA OOM log to debug layer-specific memory issues.

Re: Model Serving Endpoint: Cuda-OOM for Custom Model

DaPo — Thu, 15 May 2025 11:37:05 GMT

Hi @sarahbhord ,

thanks for the feedback. So far I was able to test a few of your suggestions. Unfortunately, no success yet.
1. Thanks for the hints. I essentially used the default conda.yaml file generated by mlflow. Unfortunately, using your suggested version did not work either.
2. Did not have the time to test that version yet. But will definitely try it next week or so 🙂
3. Tried that, using the databricks-sdk for python. Unfortunately, I got an error: databricks.sdk.errors.platform.InvalidParameterValue: Workload type 'GPU_LARGE' with size 'Large' is not supported. Please choose a node type from

4. I Am aware of the T4's memory size. To clarify, I wanted to make the following point: In a notebook, I could use 4xT4 GPUs to run the model, thus my assumption was, that it should work for the 4xMedium Size GPUs Instances provided for model serving endpoints.

Currently, I am trying 8 bit quantization, so far with little success. (Running in Timeouts on Endpoint Creation, instead of the OOM though).

Anyway, thank's for the effort.
Greetings, Daniel