cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks GPU utilization not to full extent

shanmukhasai96
New Contributor

Hi Everyone, 

I have been running below code. However Im getting CUDA out of memory error even though I have 4 GPUs in cluster which should ideally have 64 GB GPU , but the code is failing with 16 GB. I assume that the code is not utilizing all 4 GPU . How do I enable it and run on all 4 GPU ?

CUDA out of memory. Tried to allocate 980.00 MiB (GPU 0; 15.77 GiB total capacity; 10.43 GiB already allocated; 713.12 MiB free; 14.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

 

model_checkpoint = "bigscience/bloomz-560m"
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
args = TrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
    fp16= True
)
trainer = Trainer(
   model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)
1 REPLY 1

Jisong
New Contributor II

Your code is loading the full model into a single GPU so having multiple GPUs does not prevent out of memory errors. By default, transformer models only have DDP (distributed data parallel) so each GPU has a copy of your model for speeding up training. Thus the maximum VRAM that you are allowed to use is the max of a single GPU or 16 GB. The moment OOM happens for 1 GPU, it will happen for all others.

To split your model so that you are training a single model with 4 GPU, you need to set up a different type of model parallelism that splits the model into multiple shards and having each GPU train one shard and then having the GPUs communicate with each other to combine the result into a single training loop.

ZeRO DDP or Fully Sharded Data Parallel is what you are looking for.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!