cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

OutOfMemoryError: CUDA out of memory on LLM Finetuning

hv129
New Contributor
I am trying to finetune llama2_lora model using the xTuring library, while facing this error. (batch size is 1). I am working on a cluster having 1 Worker (28 GB Memory, 4 Cores) and 1 Driver (110 GB Memory, 16 Cores).
 
I am facing this error: OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 15.57 GiB total capacity; 8.02 GiB already allocated; 57.44 MiB free; 8.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

It says that the total capacity is 15.57 GBs. Does this memory represents any of the worker or driver memory? If yes, should it be more than 15.57? Is the current implementation not able to utilize the available memory?
0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group