Databricks GPU utilization not to full extent

shanmukhasai96 — Wed, 15 Nov 2023 01:11:58 GMT

Hi Everyone,

I have been running below code. However Im getting CUDA out of memory error even though I have 4 GPUs in cluster which should ideally have 64 GB GPU , but the code is failing with 16 GB. I assume that the code is not utilizing all 4 GPU . How do I enable it and run on all 4 GPU ?

CUDA out of memory. Tried to allocate 980.00 MiB (GPU 0; 15.77 GiB total capacity; 10.43 GiB already allocated; 713.12 MiB free; 14.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

model_checkpoint = "bigscience/bloomz-560m"

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

args = TrainingArguments(

f"{model_name}-finetuned-squad",

evaluation_strategy = "epoch",

learning_rate=2e-5,

per_device_train_batch_size=batch_size,

per_device_eval_batch_size=batch_size,

num_train_epochs=3,

weight_decay=0.01,

push_to_hub=False,

fp16= True

)

trainer = Trainer(

model,

args,

train_dataset=tokenized_datasets["train"],

eval_dataset=tokenized_datasets["validation"],

data_collator=data_collator,

tokenizer=tokenizer

)

Re: Databricks GPU utilization not to full extent

Jisong — Fri, 16 Feb 2024 00:15:35 GMT

Your code is loading the full model into a single GPU so having multiple GPUs does not prevent out of memory errors. By default, transformer models only have DDP (distributed data parallel) so each GPU has a copy of your model for speeding up training. Thus the maximum VRAM that you are allowed to use is the max of a single GPU or 16 GB. The moment OOM happens for 1 GPU, it will happen for all others.

To split your model so that you are training a single model with 4 GPU, you need to set up a different type of model parallelism that splits the model into multiple shards and having each GPU train one shard and then having the GPUs communicate with each other to combine the result into a single training loop.

ZeRO DDP or Fully Sharded Data Parallel is what you are looking for.

topic Databricks GPU utilization not to full extent in Get Started Discussions

Databricks GPU utilization not to full extent

Re: Databricks GPU utilization not to full extent