Databricks Community

shanmukhasai96 · ‎11-14-2023

Hi Everyone,

I have been running below code. However Im getting CUDA out of memory error even though I have 4 GPUs in cluster which should ideally have 64 GB GPU , but the code is failing with 16 GB. I assume that the code is not utilizing all 4 GPU . How do I enable it and run on all 4 GPU ?

CUDA out of memory. Tried to allocate 980.00 MiB (GPU 0; 15.77 GiB total capacity; 10.43 GiB already allocated; 713.12 MiB free; 14.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

model_checkpoint = "bigscience/bloomz-560m"

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

args = TrainingArguments(

f"{model_name}-finetuned-squad",

evaluation_strategy = "epoch",

learning_rate=2e-5,

per_device_train_batch_size=batch_size,

per_device_eval_batch_size=batch_size,

num_train_epochs=3,

weight_decay=0.01,

push_to_hub=False,

fp16= True

)

trainer = Trainer(

model,

args,

train_dataset=tokenized_datasets["train"],

eval_dataset=tokenized_datasets["validation"],

data_collator=data_collator,

tokenizer=tokenizer

)

Jisong · ‎02-15-2024

Your code is loading the full model into a single GPU so having multiple GPUs does not prevent out of memory errors. By default, transformer models only have DDP (distributed data parallel) so each GPU has a copy of your model for speeding up training. Thus the maximum VRAM that you are allowed to use is the max of a single GPU or 16 GB. The moment OOM happens for 1 GPU, it will happen for all others.

To split your model so that you are training a single model with 4 GPU, you need to set up a different type of model parallelism that splits the model into multiple shards and having each GPU train one shard and then having the GPUs communicate with each other to combine the result into a single training loop.

ZeRO DDP or Fully Sharded Data Parallel is what you are looking for.

Databricks Community

Databricks GPU utilization not to full extent

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences