<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to utilize clustered gpu for large hf models in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/how-to-utilize-clustered-gpu-for-large-hf-models/m-p/133104#M1181</link>
    <description>&lt;P&gt;1. Are you using any of the model parallel library, such as &lt;A href="https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html" target="_self"&gt;FSDP&lt;/A&gt; or &lt;A href="https://github.com/deepspeedai/DeepSpeed" target="_self"&gt;DeepSpeed&lt;/A&gt;? Otherwise, every GPU will load the entire model weights.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2. If yes in 1, Unity Catalog Volumes are exposed on every node at &lt;SPAN class="s1"&gt;/Volumes/&amp;lt;catalog&amp;gt;/&amp;lt;schema&amp;gt;/&amp;lt;volume&amp;gt;/...&lt;/SPAN&gt;, so workers can open files themselves without going through the driver.&lt;/P&gt;
&lt;P&gt;An example code will look like folllowing:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;import os, torch
local_rank = int(os.environ.get("LOCAL_RANK", 0))
ckpt_dir = "/Volumes/&amp;lt;catalog&amp;gt;/&amp;lt;schema&amp;gt;/&amp;lt;volume&amp;gt;/checkpoints/epoch-10"

# Example DeepSpeed ZeRO-3 shard name pattern; adjust to your framework.
fname = f"mp_rank_{local_rank:02}_model_states.pt"

# Always deserialize to CPU first to avoid big transient spikes in driver/GPU
state = torch.load(os.path.join(ckpt_dir, fname), map_location="cpu")
# then load into the module on this worker
model.load_state_dict(state["module"], strict=False)&lt;/LI-CODE&gt;
&lt;P&gt;Please let me know if this solved your problem. Thanks&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 26 Sep 2025 18:16:32 GMT</pubDate>
    <dc:creator>lin-yuan</dc:creator>
    <dc:date>2025-09-26T18:16:32Z</dc:date>
    <item>
      <title>How to utilize clustered gpu for large hf models</title>
      <link>https://community.databricks.com/t5/generative-ai/how-to-utilize-clustered-gpu-for-large-hf-models/m-p/120789#M925</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am using clustered GPU(driver -1GPU and Worker-3GPU), and caching model data into unity catalog but while loading model checkpoint shards its always use driver memory and failed due insufficient memory.&lt;/P&gt;&lt;P&gt;How to use complete cluster GPU while loading HF models.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 03 Jun 2025 07:26:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/how-to-utilize-clustered-gpu-for-large-hf-models/m-p/120789#M925</guid>
      <dc:creator>dk_g</dc:creator>
      <dc:date>2025-06-03T07:26:08Z</dc:date>
    </item>
    <item>
      <title>Re: How to utilize clustered gpu for large hf models</title>
      <link>https://community.databricks.com/t5/generative-ai/how-to-utilize-clustered-gpu-for-large-hf-models/m-p/133104#M1181</link>
      <description>&lt;P&gt;1. Are you using any of the model parallel library, such as &lt;A href="https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html" target="_self"&gt;FSDP&lt;/A&gt; or &lt;A href="https://github.com/deepspeedai/DeepSpeed" target="_self"&gt;DeepSpeed&lt;/A&gt;? Otherwise, every GPU will load the entire model weights.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2. If yes in 1, Unity Catalog Volumes are exposed on every node at &lt;SPAN class="s1"&gt;/Volumes/&amp;lt;catalog&amp;gt;/&amp;lt;schema&amp;gt;/&amp;lt;volume&amp;gt;/...&lt;/SPAN&gt;, so workers can open files themselves without going through the driver.&lt;/P&gt;
&lt;P&gt;An example code will look like folllowing:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;import os, torch
local_rank = int(os.environ.get("LOCAL_RANK", 0))
ckpt_dir = "/Volumes/&amp;lt;catalog&amp;gt;/&amp;lt;schema&amp;gt;/&amp;lt;volume&amp;gt;/checkpoints/epoch-10"

# Example DeepSpeed ZeRO-3 shard name pattern; adjust to your framework.
fname = f"mp_rank_{local_rank:02}_model_states.pt"

# Always deserialize to CPU first to avoid big transient spikes in driver/GPU
state = torch.load(os.path.join(ckpt_dir, fname), map_location="cpu")
# then load into the module on this worker
model.load_state_dict(state["module"], strict=False)&lt;/LI-CODE&gt;
&lt;P&gt;Please let me know if this solved your problem. Thanks&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Sep 2025 18:16:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/how-to-utilize-clustered-gpu-for-large-hf-models/m-p/133104#M1181</guid>
      <dc:creator>lin-yuan</dc:creator>
      <dc:date>2025-09-26T18:16:32Z</dc:date>
    </item>
  </channel>
</rss>

