How to allocate more memory to GPU when training t...

varun-adi · ‎09-22-2023

I am trying to train a Hubert Model, specifically the facebook/hubert-base-ls960 model on a custom speech dataset.

Training parameters are below:

trainer_config = {
"OUTPUT_DIR": "results",
"TRAIN_EPOCHS": 6,
"TRAIN_BATCH_SIZE": 2,
"EVAL_BATCH_SIZE": 2,
"GRADIENT_ACCUMULATION_STEPS": 4,
"WARMUP_STEPS": 500,
"DECAY": 0.01,
"INITIAL_LOGGING_STEPS": 10, # Smaller value for initial logging
"LOGGING_STEPS": 100, # Larger value for subsequent logging
"MODEL_DIR": "/dbfs/FileStore/wav-files/personalityDataset/Augmented-HubertModel7Epochs",
"SAVE_STEPS": 100
}

training_args = TrainingArguments(
output_dir=trainer_config["OUTPUT_DIR"],
gradient_accumulation_steps=trainer_config["GRADIENT_ACCUMULATION_STEPS"],
num_train_epochs=trainer_config["TRAIN_EPOCHS"],
per_device_train_batch_size=trainer_config["TRAIN_BATCH_SIZE"],
per_device_eval_batch_size=trainer_config["EVAL_BATCH_SIZE"],
warmup_steps=trainer_config["WARMUP_STEPS"],
save_steps=trainer_config["SAVE_STEPS"],
weight_decay=trainer_config["DECAY"],
evaluation_strategy="epoch", # Report metrics at the end of each epoch
logging_steps=trainer_config["INITIAL_LOGGING_STEPS"], # Initial logging frequency
fp16=True # Enable mixed-precision training
)

Running the nvidia-smi command yields below output:

We tried to expand the cluster memory to 32GB and current cluster configuration is:

1-2 Workers32-64 GB Memory8-16 Cores
1 Driver32 GB Memory, 8 Cores
Runtime13.1.x-gpu-ml-scala2.12

However, the memory allocated to GPU is still only ~16GB.

Due to this, training fails with below error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 530.00 MiB (GPU 0; 14.76 GiB total capacity; 12.87 GiB already allocated; 411.75 MiB free; 13.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried reducing batch-size to 1 also but still the same error persists.
How can I ensure that more memory is available to CUDA and the process when training through notebook?

How to allocate more memory to GPU when training through databricks notebook