cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

How to allocate more memory to GPU when training through databricks notebook

varun-adi
New Contributor

I am trying to train a Hubert Model, specifically the facebook/hubert-base-ls960 model on a custom speech dataset.

Training parameters are below:

trainer_config = {
  "OUTPUT_DIR": "results",
  "TRAIN_EPOCHS": 6,
  "TRAIN_BATCH_SIZE": 2,
  "EVAL_BATCH_SIZE": 2,
  "GRADIENT_ACCUMULATION_STEPS": 4,
  "WARMUP_STEPS": 500,
  "DECAY": 0.01,
  "INITIAL_LOGGING_STEPS": 10,  # Smaller value for initial logging
  "LOGGING_STEPS": 100,  # Larger value for subsequent logging
  "MODEL_DIR": "/dbfs/FileStore/wav-files/personalityDataset/Augmented-HubertModel7Epochs",
  "SAVE_STEPS": 100
}

training_args = TrainingArguments(
    output_dir=trainer_config["OUTPUT_DIR"],
    gradient_accumulation_steps=trainer_config["GRADIENT_ACCUMULATION_STEPS"],
    num_train_epochs=trainer_config["TRAIN_EPOCHS"],
    per_device_train_batch_size=trainer_config["TRAIN_BATCH_SIZE"],
    per_device_eval_batch_size=trainer_config["EVAL_BATCH_SIZE"],
    warmup_steps=trainer_config["WARMUP_STEPS"],
    save_steps=trainer_config["SAVE_STEPS"],
    weight_decay=trainer_config["DECAY"],
    evaluation_strategy="epoch",  # Report metrics at the end of each epoch
    logging_steps=trainer_config["INITIAL_LOGGING_STEPS"],  # Initial logging frequency
    fp16=True  # Enable mixed-precision training
)


Running the nvidia-smi command yields below output:

-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 35C P8 9W / 70W | 3MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

We tried to expand the cluster memory to 32GB and current cluster configuration is:

1-2 Workers32-64 GB Memory8-16 Cores
1 Driver32 GB Memory, 8 Cores
Runtime13.1.x-gpu-ml-scala2.12

However, the memory allocated to GPU is still only ~16GB.

Due to this, training fails with below error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 530.00 MiB (GPU 0; 14.76 GiB total capacity; 12.87 GiB already allocated; 411.75 MiB free; 13.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried reducing batch-size to 1 also but still the same error persists.
How can I ensure that more memory is available to CUDA and the process when training through notebook?

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @varun-adiBased on the provided information, there are several ways you can try to ensure more memory is available to CUDA and the process when training through the notebook:

1. **Tune the batch size**: You've already tried reducing the batch size to 1, but the error persists. You may want to consider tuning the batch size to drive full GPU utilization but not result in CUDA out of memory errors.

2. **Tune parallelism with stage-level scheduling**: By default, Spark schedules one task per GPU on each machine. You can increase parallelism by telling Spark how many tasks to run per GPU.

3. **Repartition data to use all available hardware**: Make full use of the hardware in your cluster by repartitioning your data. You can repartition a DataFrame using repartitioned_df = df.repartition(desired_partition_count).

4. **Cache the model**: If you are frequently loading a model from different or restarted clusters, you may wish to cache the Hugging Face model in the DBFS root volume to save model load time or ingress costs.

Remember to monitor GPU performance by viewing the live cluster metrics for a cluster, and choosing a metric, such as gpu0-util for GPU processor utilization or gpu0_mem_util for GPU memory utilization.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.