<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Model Serving Endpoint: Cuda-OOM for Custom Model in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/model-serving-endpoint-cuda-oom-for-custom-model/m-p/118191#M4066</link>
    <description>&lt;P&gt;Hello all,&lt;/P&gt;&lt;P&gt;I am tasked to evaluate a new LLM&amp;nbsp; for some use-cases. In particular, I need to build a POC for a chat bot based on that model. To that end, I want to create a custom Serving Endpoint for an LLM pulled from huggingfaces. The model itself is based on QWEN (&lt;A href="https://huggingface.co/bytedance-research/ChatTS-14B" target="_blank" rel="noopener"&gt;here&lt;/A&gt;&amp;nbsp;is the model I need to use). So far, I logged the model into our ml-flow tracking (with the &lt;FONT face="andale mono,times"&gt;transformers&amp;nbsp;&lt;/FONT&gt;model-flavor and registered it in our Unity-Catalog. Now I want to create a model serving endpoint, using a 4xGPUs instance. To my understanding, the GPUs in total have enough GPU-Memory for the model, and the libraries in use should handle this setup, distributing the model on multiple GPUs (see down below).&amp;nbsp;However, creation of the model endpoint fails, with a &lt;FONT face="terminal,monaco" color="#FF0000"&gt;CUDA-OOM&lt;/FONT&gt; error showing up in the logs. (The Model would need ~30GB of Memory, more than one of the GPUs, but far less than the available memory across all four GPUs.) Note, that for a lower memory footprint, I have already saved the model with 16bit floating point precession.&lt;/P&gt;&lt;P&gt;What can I do, to avoid the error?&lt;/P&gt;&lt;P&gt;Some additional context:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;I did load the model in a Notebook on a similar sized All-Purpose-Compute Cluster (based on a single node &lt;EM&gt;&lt;FONT face="terminal,monaco"&gt;gdn.12xlarge&lt;/FONT&gt;&lt;/EM&gt;&amp;nbsp;instance, wich has 4xGPUs of type T4).&lt;/LI&gt;&lt;LI&gt;Below, you see the content of the conda.yaml file in the mlflow artifacts.&lt;/LI&gt;&lt;LI&gt;&lt;EM&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;&lt;/EM&gt;&lt;SPAN&gt;, that based on my experiments in the notebook: For some reason, the default &lt;/SPAN&gt;&lt;FONT face="andale mono,times"&gt;device_map&amp;nbsp;&lt;/FONT&gt;&lt;SPAN&gt;strategy would not work, producing CUDA-OOM errors. I had to specify it explicitly. I added the &lt;/SPAN&gt;&lt;A href="https://mlflow.org/docs/latest/api_reference/python_api/mlflow.environment_variables.html#mlflow.environment_variables.MLFLOW_HUGGINGFACE_DEVICE_MAP_STRATEGY" target="_blank" rel="noopener"&gt;appropriate variable&lt;/A&gt;&lt;SPAN&gt; to the conda env.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;channels&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;BR /&gt;&lt;/SPAN&gt;  &lt;SPAN class=""&gt;-&lt;/SPAN&gt; conda&lt;SPAN class=""&gt;-&lt;/SPAN&gt;forge &lt;BR /&gt;&lt;SPAN class=""&gt;dependencies&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN class=""&gt;  -&lt;/SPAN&gt; python=3.12.3&lt;BR /&gt;&lt;SPAN class=""&gt;  -&lt;/SPAN&gt; pip&lt;BR /&gt;&lt;SPAN class=""&gt;  -&lt;/SPAN&gt; &lt;SPAN class=""&gt;pip&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;BR /&gt;&lt;/SPAN&gt;    &lt;SPAN class=""&gt;-&lt;/SPAN&gt; mlflow==2.21.3&lt;BR /&gt;    &lt;SPAN class=""&gt;-&lt;/SPAN&gt; accelerate==1.5.2&lt;BR /&gt;&lt;SPAN class=""&gt;    -&lt;/SPAN&gt; torch==2.6.0&lt;BR /&gt;&lt;SPAN class=""&gt;    -&lt;/SPAN&gt; torchvision==0.21.0&lt;BR /&gt;&lt;SPAN class=""&gt;    -&lt;/SPAN&gt; transformers==4.51.3&lt;BR /&gt;&lt;SPAN class=""&gt;name&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; chatts&lt;SPAN class=""&gt;-&lt;/SPAN&gt;env&lt;BR /&gt;&lt;SPAN class=""&gt;variables&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; &lt;SPAN class=""&gt;MLFLOW_HUGGINGFACE_DEVICE_MAP_STRATEGY&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; sequential&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 07 May 2025 13:23:29 GMT</pubDate>
    <dc:creator>DaPo</dc:creator>
    <dc:date>2025-05-07T13:23:29Z</dc:date>
    <item>
      <title>Model Serving Endpoint: Cuda-OOM for Custom Model</title>
      <link>https://community.databricks.com/t5/machine-learning/model-serving-endpoint-cuda-oom-for-custom-model/m-p/118191#M4066</link>
      <description>&lt;P&gt;Hello all,&lt;/P&gt;&lt;P&gt;I am tasked to evaluate a new LLM&amp;nbsp; for some use-cases. In particular, I need to build a POC for a chat bot based on that model. To that end, I want to create a custom Serving Endpoint for an LLM pulled from huggingfaces. The model itself is based on QWEN (&lt;A href="https://huggingface.co/bytedance-research/ChatTS-14B" target="_blank" rel="noopener"&gt;here&lt;/A&gt;&amp;nbsp;is the model I need to use). So far, I logged the model into our ml-flow tracking (with the &lt;FONT face="andale mono,times"&gt;transformers&amp;nbsp;&lt;/FONT&gt;model-flavor and registered it in our Unity-Catalog. Now I want to create a model serving endpoint, using a 4xGPUs instance. To my understanding, the GPUs in total have enough GPU-Memory for the model, and the libraries in use should handle this setup, distributing the model on multiple GPUs (see down below).&amp;nbsp;However, creation of the model endpoint fails, with a &lt;FONT face="terminal,monaco" color="#FF0000"&gt;CUDA-OOM&lt;/FONT&gt; error showing up in the logs. (The Model would need ~30GB of Memory, more than one of the GPUs, but far less than the available memory across all four GPUs.) Note, that for a lower memory footprint, I have already saved the model with 16bit floating point precession.&lt;/P&gt;&lt;P&gt;What can I do, to avoid the error?&lt;/P&gt;&lt;P&gt;Some additional context:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;I did load the model in a Notebook on a similar sized All-Purpose-Compute Cluster (based on a single node &lt;EM&gt;&lt;FONT face="terminal,monaco"&gt;gdn.12xlarge&lt;/FONT&gt;&lt;/EM&gt;&amp;nbsp;instance, wich has 4xGPUs of type T4).&lt;/LI&gt;&lt;LI&gt;Below, you see the content of the conda.yaml file in the mlflow artifacts.&lt;/LI&gt;&lt;LI&gt;&lt;EM&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;&lt;/EM&gt;&lt;SPAN&gt;, that based on my experiments in the notebook: For some reason, the default &lt;/SPAN&gt;&lt;FONT face="andale mono,times"&gt;device_map&amp;nbsp;&lt;/FONT&gt;&lt;SPAN&gt;strategy would not work, producing CUDA-OOM errors. I had to specify it explicitly. I added the &lt;/SPAN&gt;&lt;A href="https://mlflow.org/docs/latest/api_reference/python_api/mlflow.environment_variables.html#mlflow.environment_variables.MLFLOW_HUGGINGFACE_DEVICE_MAP_STRATEGY" target="_blank" rel="noopener"&gt;appropriate variable&lt;/A&gt;&lt;SPAN&gt; to the conda env.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;channels&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;BR /&gt;&lt;/SPAN&gt;  &lt;SPAN class=""&gt;-&lt;/SPAN&gt; conda&lt;SPAN class=""&gt;-&lt;/SPAN&gt;forge &lt;BR /&gt;&lt;SPAN class=""&gt;dependencies&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN class=""&gt;  -&lt;/SPAN&gt; python=3.12.3&lt;BR /&gt;&lt;SPAN class=""&gt;  -&lt;/SPAN&gt; pip&lt;BR /&gt;&lt;SPAN class=""&gt;  -&lt;/SPAN&gt; &lt;SPAN class=""&gt;pip&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;BR /&gt;&lt;/SPAN&gt;    &lt;SPAN class=""&gt;-&lt;/SPAN&gt; mlflow==2.21.3&lt;BR /&gt;    &lt;SPAN class=""&gt;-&lt;/SPAN&gt; accelerate==1.5.2&lt;BR /&gt;&lt;SPAN class=""&gt;    -&lt;/SPAN&gt; torch==2.6.0&lt;BR /&gt;&lt;SPAN class=""&gt;    -&lt;/SPAN&gt; torchvision==0.21.0&lt;BR /&gt;&lt;SPAN class=""&gt;    -&lt;/SPAN&gt; transformers==4.51.3&lt;BR /&gt;&lt;SPAN class=""&gt;name&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; chatts&lt;SPAN class=""&gt;-&lt;/SPAN&gt;env&lt;BR /&gt;&lt;SPAN class=""&gt;variables&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; &lt;SPAN class=""&gt;MLFLOW_HUGGINGFACE_DEVICE_MAP_STRATEGY&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; sequential&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 07 May 2025 13:23:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/model-serving-endpoint-cuda-oom-for-custom-model/m-p/118191#M4066</guid>
      <dc:creator>DaPo</dc:creator>
      <dc:date>2025-05-07T13:23:29Z</dc:date>
    </item>
    <item>
      <title>Re: Model Serving Endpoint: Cuda-OOM for Custom Model</title>
      <link>https://community.databricks.com/t5/machine-learning/model-serving-endpoint-cuda-oom-for-custom-model/m-p/118265#M4067</link>
      <description>&lt;P&gt;Here are some suggestions:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1. Update coda.yaml. Replace the current config with this optimized version:&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;channels:
  - conda-forge
dependencies:
  - python=3.10  # 3.12 may cause compatibility issues
  - pip
  - pip:
    - mlflow==2.21.3
    - torch==2.2.1  # Align with CUDA 12.1
    - transformers==4.40.0  # Latest stable for multi-GPU
    - accelerate==0.29.0  # Critical for device_map="auto"
    - bitsandbytes==0.43.0  # For 8/4-bit quantization
    - xformers==0.0.25  # Memory-efficient attention
name: chatts-env
variables:
  MLFLOW_HUGGINGFACE_DEVICE_MAP_STRATEGY: auto  # Not "sequential"
&lt;/LI-CODE&gt;
&lt;P&gt;2. Model loading fixes. In your MLFlow model's inference script, enforce multi-GPU distribution.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def load_model(model_path):
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",  # Distribute across all GPUs
        torch_dtype=torch.float16,  # 16-bit
        trust_remote_code=True,
        low_cpu_mem_usage=True  # Reduce CPU RAM pressure
    )
    return model
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;3. Serving endpoint configuration. Use this JSON payload when creating the endpoint to ensure tensor parallelism.&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;{
  "name": "qwen-chat-endpoint",
  "config": {
    "served_entities": [{
      "entity_name": "catalog.schema.model_name",
      "entity_version": "1",
      "workload_type": "GPU_LARGE",  # Use A100 GPUs (80GB each)
      "workload_size": "Large",      # 4xGPUs
      "task": "llm/v1/completions",
      "environment_vars": {
        "HF_HOME": "/dbfs/huggingface",
        "MAX_JOBS": "4"  # Parallelize model loading
      }
    }]
  }
}
&lt;/LI-CODE&gt;
&lt;P&gt;4. Other adjustments:&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Avoid T4 GPUs: They only have 16GB each. use A100 instances with 40GB/GPU.&lt;/LI&gt;
&lt;LI&gt;Quantize further: Add load_in_8bit=True to your model loading code if 16-bit isnt enough.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Check layer splitting: if device_map="auto"" fails, manually specify no_split_module_classes for QWEN's architecture.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="relative default font-sans text-base text-textMain dark:text-textMainDark selection:bg-super/50 selection:text-textMain dark:selection:bg-superDuper/10 dark:selection:text-superDark"&gt;
&lt;DIV class="min-w-0 break-words [word-break:break-word]"&gt;
&lt;DIV id="markdown-content-18" class="gap-y-md after:clear-both after:block after:content-['']" dir="auto"&gt;
&lt;DIV class="relative"&gt;
&lt;DIV class="prose text-pretty dark:prose-invert inline leading-normal break-words min-w-0 [word-break:break-word]"&gt;
&lt;P class="my-0"&gt;If the error persists, share the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;full CUDA OOM log&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to debug layer-specific memory issues.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="flex items-center justify-between"&gt;
&lt;DIV class="-ml-sm gap-xs flex items-center"&gt;
&lt;DIV class="flex items-center min-w-0 font-medium gap-1.5 justify-center"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="gap-x-xs flex items-center"&gt;
&lt;DIV class="flex items-center min-w-0 font-medium gap-1.5 justify-center"&gt;
&lt;DIV class="flex shrink-0 items-center justify-center size-4"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="gap-xs flex items-center border-borderMain/50 ring-borderMain/50 divide-borderMain/50 dark:divide-borderMainDark/50  dark:ring-borderMainDark/50 dark:border-borderMainDark/50 bg-transparent"&gt;
&lt;DIV class="flex items-center min-w-0 font-medium gap-1.5 justify-center"&gt;
&lt;DIV class="flex shrink-0 items-center justify-center size-4"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="flex items-center min-w-0 font-medium gap-1.5 justify-center"&gt;
&lt;DIV class="flex shrink-0 items-center justify-center size-4"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="flex items-center min-w-0 font-medium gap-1.5 justify-center"&gt;
&lt;DIV class="flex shrink-0 items-center justify-center size-4"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class=""&gt;
&lt;DIV class="flex items-center min-w-0 font-medium gap-1.5 justify-center"&gt;
&lt;DIV class="flex shrink-0 items-center justify-center size-4"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="mt-md border-borderMain/50 ring-borderMain/50 divide-borderMain/50 dark:divide-borderMainDark/50  dark:ring-borderMainDark/50 dark:border-borderMainDark/50 bg-transparent"&gt;
&lt;DIV class="flex w-full items-center justify-between rounded-lg p-4 border-borderMain/50 ring-borderMain/50 divide-borderMain/50 dark:divide-borderMainDark/50  dark:ring-borderMainDark/50 dark:border-borderMainDark/50 bg-offset dark:bg-offsetDark"&gt;
&lt;DIV class="flex items-center gap-3 border-borderMain/50 ring-borderMain/50 divide-borderMain/50 dark:divide-borderMainDark/50  dark:ring-borderMainDark/50 dark:border-borderMainDark/50 bg-transparent"&gt;
&lt;DIV class="border-borderMain/50 ring-borderMain/50 divide-borderMain/50 dark:divide-borderMainDark/50  dark:ring-borderMainDark/50 dark:border-borderMainDark/50 bg-transparent"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Wed, 07 May 2025 16:52:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/model-serving-endpoint-cuda-oom-for-custom-model/m-p/118265#M4067</guid>
      <dc:creator>sarahbhord</dc:creator>
      <dc:date>2025-05-07T16:52:42Z</dc:date>
    </item>
    <item>
      <title>Re: Model Serving Endpoint: Cuda-OOM for Custom Model</title>
      <link>https://community.databricks.com/t5/machine-learning/model-serving-endpoint-cuda-oom-for-custom-model/m-p/119310#M4079</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/143567"&gt;@sarahbhord&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;thanks for the feedback. So far I was able to test a few of your suggestions. Unfortunately, no success yet.&lt;BR /&gt;1. Thanks for the hints. I essentially used the default conda.yaml file generated by mlflow. Unfortunately, using your suggested version did not work either.&lt;BR /&gt;2. Did not have the time to test that version yet. But will definitely try it next week or so &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;3. Tried that, using the databricks-sdk for python. Unfortunately, I got an error:&amp;nbsp;databricks.sdk.errors.platform.InvalidParameterValue:&lt;FONT face="andale mono,times" color="#993300"&gt; Workload type 'GPU_LARGE' with size 'Large' is not supported. Please choose a node type from&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;4. I Am aware of the T4's memory size. To clarify, I wanted to make the following point: In a notebook, I could use 4xT4 GPUs to run the model, thus my assumption was, that it should work for the 4xMedium Size GPUs Instances provided for model serving endpoints.&lt;/P&gt;&lt;P&gt;Currently, I am trying 8 bit quantization, so far with little success. (Running in Timeouts on Endpoint Creation, instead of the OOM though).&lt;/P&gt;&lt;P&gt;Anyway, thank's for the effort.&lt;BR /&gt;Greetings, Daniel&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 15 May 2025 11:37:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/model-serving-endpoint-cuda-oom-for-custom-model/m-p/119310#M4079</guid>
      <dc:creator>DaPo</dc:creator>
      <dc:date>2025-05-15T11:37:05Z</dc:date>
    </item>
  </channel>
</rss>

