cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

CUDA out of memory

gary7135
New Contributor II

I am trying out the new Meta LLama2 model.

Following the databricks provided notebook example: https://github.com/databricks/databricks-ml-examples/blob/master/llm-models/llamav2/llamav2-13b/01_l...

 

I keep getting CUDA out of memory. My GPU cluster runtime is 

13.2 ML (includes Apache Spark 3.4.0, GPU, Scala 2.12), with 256GB memory and 1 GPU

 

Error message:

CUDA out of memory. Tried to allocate 314.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 313.75 MiB free; 13.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

 

 

What would be a good way to solve this issue?

6 REPLIES 6

Kumaran
Databricks Employee
Databricks Employee

Hi @gary7135,

Thank you for posting the question in the Databricks community.

Kindly update the configuration by setting fp16=True instead of its current value of false. For further information regarding the CUDA error related to this, please refer to this documentation

gary7135
New Contributor II

Thank you. Can you provide example of how to set this argument in notebooks?

Kumaran
Databricks Employee
Databricks Employee

Hello @gary7135,

Thank you for the response.

According to GitHub (you shared above), you should have a configuration file where you need to make this settings. Please refer to the image below for more details:

Kumaran_0-1689964296082.png

gary7135
New Contributor II

Thank you. I am running this python file directly in Databricks notebook https://github.com/databricks/databricks-ml-examples/blob/master/llm-models/llamav2/llamav2-7b/01_lo...

The file does not seem to reference the config json file?

Kumaran
Databricks Employee
Databricks Employee

Hi @gary7135,

If you are now following the same as GitHub.You may somehow work on how to point out the configuration fp16=True to your file,

Anonymous
Not applicable

Hi @Kumaran 

Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.

Cheers!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group