Hi Everyone,
I have been running below code. However Im getting CUDA out of memory error even though I have 4 GPUs in cluster which should ideally have 64 GB GPU , but the code is failing with 16 GB. I assume that the code is not utilizing all 4 GPU . How do I enable it and run on all 4 GPU ?
CUDA out of memory. Tried to allocate 980.00 MiB (GPU 0; 15.77 GiB total capacity; 10.43 GiB already allocated; 713.12 MiB free; 14.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
model_checkpoint = "bigscience/bloomz-560m"
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
args = TrainingArguments(
f"{model_name}-finetuned-squad",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=3,
weight_decay=0.01,
push_to_hub=False,
fp16= True
)
trainer = Trainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer
)