How to allocate more memory to GPU when training through databricks notebook

varun-adi — Fri, 22 Sep 2023 07:21:13 GMT

I am trying to train a Hubert Model, specifically the facebook/hubert-base-ls960 model on a custom speech dataset.

Training parameters are below:

trainer_config = {
"OUTPUT_DIR": "results",
"TRAIN_EPOCHS": 6,
"TRAIN_BATCH_SIZE": 2,
"EVAL_BATCH_SIZE": 2,
"GRADIENT_ACCUMULATION_STEPS": 4,
"WARMUP_STEPS": 500,
"DECAY": 0.01,
"INITIAL_LOGGING_STEPS": 10, # Smaller value for initial logging
"LOGGING_STEPS": 100, # Larger value for subsequent logging
"MODEL_DIR": "/dbfs/FileStore/wav-files/personalityDataset/Augmented-HubertModel7Epochs",
"SAVE_STEPS": 100
}

training_args = TrainingArguments(
output_dir=trainer_config["OUTPUT_DIR"],
gradient_accumulation_steps=trainer_config["GRADIENT_ACCUMULATION_STEPS"],
num_train_epochs=trainer_config["TRAIN_EPOCHS"],
per_device_train_batch_size=trainer_config["TRAIN_BATCH_SIZE"],
per_device_eval_batch_size=trainer_config["EVAL_BATCH_SIZE"],
warmup_steps=trainer_config["WARMUP_STEPS"],
save_steps=trainer_config["SAVE_STEPS"],
weight_decay=trainer_config["DECAY"],
evaluation_strategy="epoch", # Report metrics at the end of each epoch
logging_steps=trainer_config["INITIAL_LOGGING_STEPS"], # Initial logging frequency
fp16=True # Enable mixed-precision training
)

Running the nvidia-smi command yields below output:

We tried to expand the cluster memory to 32GB and current cluster configuration is:

1-2 Workers32-64 GB Memory8-16 Cores
1 Driver32 GB Memory, 8 Cores
Runtime13.1.x-gpu-ml-scala2.12

However, the memory allocated to GPU is still only ~16GB.

Due to this, training fails with below error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 530.00 MiB (GPU 0; 14.76 GiB total capacity; 12.87 GiB already allocated; 411.75 MiB free; 13.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried reducing batch-size to 1 also but still the same error persists.
How can I ensure that more memory is available to CUDA and the process when training through notebook?

topic How to allocate more memory to GPU when training through databricks notebook in Machine Learning

How to allocate more memory to GPU when training through databricks notebook