<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OutOfMemoryError: CUDA out of memory on LLM Finetuning in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/outofmemoryerror-cuda-out-of-memory-on-llm-finetuning/m-p/61255#M3040</link>
    <description>&lt;DIV class=""&gt;&lt;SPAN class=""&gt;&lt;SPAN&gt;I am trying to finetune &lt;STRONG&gt;llama2_lora&lt;/STRONG&gt; model using the &lt;STRONG&gt;xTuring&lt;/STRONG&gt; library, while facing this error. (batch size is 1).&amp;nbsp;&lt;/SPAN&gt;I am working on a cluster having&amp;nbsp;&lt;STRONG&gt;1 Worker (&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;28&amp;nbsp;GB Memory,&amp;nbsp;&lt;/STRONG&gt;&lt;SPAN&gt;&lt;STRONG&gt;4&amp;nbsp;Cores)&lt;/STRONG&gt; and&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;1 Driver (&lt;/SPAN&gt;&lt;SPAN class=""&gt;110&amp;nbsp;GB Memory,&amp;nbsp;16&amp;nbsp;Cores).&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;I am facing this error:&amp;nbsp;&lt;SPAN class=""&gt;OutOfMemoryError: &lt;/SPAN&gt;&lt;SPAN&gt;CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 15.57 GiB total capacity; 8.02 GiB already allocated; 57.44 MiB free; 8.02 GiB reserved in total by PyTorch) If reserved memory is &amp;gt;&amp;gt; allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;It says that the&amp;nbsp;&lt;SPAN&gt;total capacity is 15.57 GBs. Does this memory represents any of the worker or driver memory? If yes, should it be more than 15.57? Is the current implementation not able to utilize the available memory?&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 20 Feb 2024 12:40:24 GMT</pubDate>
    <dc:creator>hv129</dc:creator>
    <dc:date>2024-02-20T12:40:24Z</dc:date>
    <item>
      <title>OutOfMemoryError: CUDA out of memory on LLM Finetuning</title>
      <link>https://community.databricks.com/t5/machine-learning/outofmemoryerror-cuda-out-of-memory-on-llm-finetuning/m-p/61255#M3040</link>
      <description>&lt;DIV class=""&gt;&lt;SPAN class=""&gt;&lt;SPAN&gt;I am trying to finetune &lt;STRONG&gt;llama2_lora&lt;/STRONG&gt; model using the &lt;STRONG&gt;xTuring&lt;/STRONG&gt; library, while facing this error. (batch size is 1).&amp;nbsp;&lt;/SPAN&gt;I am working on a cluster having&amp;nbsp;&lt;STRONG&gt;1 Worker (&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;&lt;STRONG&gt;28&amp;nbsp;GB Memory,&amp;nbsp;&lt;/STRONG&gt;&lt;SPAN&gt;&lt;STRONG&gt;4&amp;nbsp;Cores)&lt;/STRONG&gt; and&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;1 Driver (&lt;/SPAN&gt;&lt;SPAN class=""&gt;110&amp;nbsp;GB Memory,&amp;nbsp;16&amp;nbsp;Cores).&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;I am facing this error:&amp;nbsp;&lt;SPAN class=""&gt;OutOfMemoryError: &lt;/SPAN&gt;&lt;SPAN&gt;CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 15.57 GiB total capacity; 8.02 GiB already allocated; 57.44 MiB free; 8.02 GiB reserved in total by PyTorch) If reserved memory is &amp;gt;&amp;gt; allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;It says that the&amp;nbsp;&lt;SPAN&gt;total capacity is 15.57 GBs. Does this memory represents any of the worker or driver memory? If yes, should it be more than 15.57? Is the current implementation not able to utilize the available memory?&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 20 Feb 2024 12:40:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/outofmemoryerror-cuda-out-of-memory-on-llm-finetuning/m-p/61255#M3040</guid>
      <dc:creator>hv129</dc:creator>
      <dc:date>2024-02-20T12:40:24Z</dc:date>
    </item>
  </channel>
</rss>

