Databricks Community

varun-adi · ‎09-22-2023

I am trying to train a Hubert Model, specifically the facebook/hubert-base-ls960 model on a custom speech dataset.

Training parameters are below:

trainer_config = {
"OUTPUT_DIR": "results",
"TRAIN_EPOCHS": 6,
"TRAIN_BATCH_SIZE": 2,
"EVAL_BATCH_SIZE": 2,
"GRADIENT_ACCUMULATION_STEPS": 4,
"WARMUP_STEPS": 500,
"DECAY": 0.01,
"INITIAL_LOGGING_STEPS": 10, # Smaller value for initial logging
"LOGGING_STEPS": 100, # Larger value for subsequent logging
"MODEL_DIR": "/dbfs/FileStore/wav-files/personalityDataset/Augmented-HubertModel7Epochs",
"SAVE_STEPS": 100
}

training_args = TrainingArguments(
output_dir=trainer_config["OUTPUT_DIR"],
gradient_accumulation_steps=trainer_config["GRADIENT_ACCUMULATION_STEPS"],
num_train_epochs=trainer_config["TRAIN_EPOCHS"],
per_device_train_batch_size=trainer_config["TRAIN_BATCH_SIZE"],
per_device_eval_batch_size=trainer_config["EVAL_BATCH_SIZE"],
warmup_steps=trainer_config["WARMUP_STEPS"],
save_steps=trainer_config["SAVE_STEPS"],
weight_decay=trainer_config["DECAY"],
evaluation_strategy="epoch", # Report metrics at the end of each epoch
logging_steps=trainer_config["INITIAL_LOGGING_STEPS"], # Initial logging frequency
fp16=True # Enable mixed-precision training
)

Running the nvidia-smi command yields below output:

We tried to expand the cluster memory to 32GB and current cluster configuration is:

1-2 Workers32-64 GB Memory8-16 Cores
1 Driver32 GB Memory, 8 Cores
Runtime13.1.x-gpu-ml-scala2.12

However, the memory allocated to GPU is still only ~16GB.

Due to this, training fails with below error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 530.00 MiB (GPU 0; 14.76 GiB total capacity; 12.87 GiB already allocated; 411.75 MiB free; 13.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried reducing batch-size to 1 also but still the same error persists.
How can I ensure that more memory is available to CUDA and the process when training through notebook?

Databricks Community

How to allocate more memory to GPU when training through databricks notebook

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟