cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to allocate more memory to GPU when training through databricks notebook

varun-adi
New Contributor

I am trying to train a Hubert Model, specifically the facebook/hubert-base-ls960 model on a custom speech dataset.

Training parameters are below:

trainer_config = {
  "OUTPUT_DIR": "results",
  "TRAIN_EPOCHS": 6,
  "TRAIN_BATCH_SIZE": 2,
  "EVAL_BATCH_SIZE": 2,
  "GRADIENT_ACCUMULATION_STEPS": 4,
  "WARMUP_STEPS": 500,
  "DECAY": 0.01,
  "INITIAL_LOGGING_STEPS": 10,  # Smaller value for initial logging
  "LOGGING_STEPS": 100,  # Larger value for subsequent logging
  "MODEL_DIR": "/dbfs/FileStore/wav-files/personalityDataset/Augmented-HubertModel7Epochs",
  "SAVE_STEPS": 100
}

training_args = TrainingArguments(
    output_dir=trainer_config["OUTPUT_DIR"],
    gradient_accumulation_steps=trainer_config["GRADIENT_ACCUMULATION_STEPS"],
    num_train_epochs=trainer_config["TRAIN_EPOCHS"],
    per_device_train_batch_size=trainer_config["TRAIN_BATCH_SIZE"],
    per_device_eval_batch_size=trainer_config["EVAL_BATCH_SIZE"],
    warmup_steps=trainer_config["WARMUP_STEPS"],
    save_steps=trainer_config["SAVE_STEPS"],
    weight_decay=trainer_config["DECAY"],
    evaluation_strategy="epoch",  # Report metrics at the end of each epoch
    logging_steps=trainer_config["INITIAL_LOGGING_STEPS"],  # Initial logging frequency
    fp16=True  # Enable mixed-precision training
)


Running the nvidia-smi command yields below output:

-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 35C P8 9W / 70W | 3MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

We tried to expand the cluster memory to 32GB and current cluster configuration is:

1-2 Workers32-64 GB Memory8-16 Cores
1 Driver32 GB Memory, 8 Cores
Runtime13.1.x-gpu-ml-scala2.12

However, the memory allocated to GPU is still only ~16GB.

Due to this, training fails with below error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 530.00 MiB (GPU 0; 14.76 GiB total capacity; 12.87 GiB already allocated; 411.75 MiB free; 13.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried reducing batch-size to 1 also but still the same error persists.
How can I ensure that more memory is available to CUDA and the process when training through notebook?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group