cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

OutOfMemoryError: CUDA out of memory on LLM Finetuning

hv129
New Contributor
I am trying to finetune llama2_lora model using the xTuring library, while facing this error. (batch size is 1). I am working on a cluster having 1 Worker (28 GB Memory, 4 Cores) and 1 Driver (110 GB Memory, 16 Cores).
 
I am facing this error: OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 15.57 GiB total capacity; 8.02 GiB already allocated; 57.44 MiB free; 8.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

It says that the total capacity is 15.57 GBs. Does this memory represents any of the worker or driver memory? If yes, should it be more than 15.57? Is the current implementation not able to utilize the available memory?
1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @hv129,

The error message you’re encountering indicates that your CUDA memory is running out while trying to allocate additional memory for your model.

Let’s break down the details:

  1. Total Capacity: The 15.57 GiB mentioned in the error message represents the total GPU memory capacity available on the GPU device (GPU 0). This capacity includes both the allocated memory (already in use) and the free memory (available for allocation).

  2. Allocated Memory: Currently, 8.02 GiB of memory is already allocated by PyTorch for other purposes. This could be due to other running processes or models.

  3. Free Memory: There is 57.44 MiB of free memory available on the GPU. This is the space that can be used for additional allocations.

  4. Reserved Memory: The remaining 8.02 GiB is reserved by PyTorch but not yet allocated. This reserved memory is not available for other tasks.

Now, let’s address your questions:

  • Does this memory represent any of the worker or driver memory?

    • No, the GPU memory is separate from the memory of your worker and driver nodes. The GPU memory is specific to the GPU device you’re using for computations.
  • Should it be more than 15.57 GiB?

    • Ideally, you want to ensure that your model’s memory requirements (including intermediate tensors during forward and backward passes) do not exceed the available GPU memory. If your model requires more memory than what’s available, you’ll encounter CUDA out-of-memory errors.
    • In your case, the total capacity of 15.57 GiB should be sufficient for most models, but it depends on the specific model architecture and batch size.
  • Is the current implementation not able to utilize the available memory?

    • It seems that the current implementation is using a significant portion of the available memory, leaving only a small amount of free memory.
    • To address this issue, consider the following steps:
      • Reduce Batch Size: Since you’re using a batch size of 1, try reducing it further to free up more memory.
      • Check Model Size: Verify if the llama2_lora model itself is too large for the available memory. If so, consider using a smaller model or a more memory-efficient variant.
      • Memory Management: Explore PyTorch’s memory management options, such as setting max_split_size_mb to avoid fragmentation.
      • Offload to CPU: If GPU memory remains insufficient, consider offloading some computations to the CPU (if feasible).
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.