Showing results for 
Search instead for 
Did you mean: 
GenAI Insight Hub
Showing results for 
Search instead for 
Did you mean: 

How can I utilize multiple GPUs from multiple nodes in Databricks

New Contributor

I am currently experimenting with the whisper model for batchwise inference on Databricks and have successfully utilized multiple instances of the model by accessing multiple GPUs available in the driver node. However, I am wondering how I can leverage the multiple GPUs present in each of the worker nodes, as I am unable to access them. I have come across documentation on utilizing all worker nodes with pyspark-based libraries, but I am specifically interested in how to achieve this with a transformer model like whisper. Any insights or suggestions would be greatly appreciated.


Community Manager
Community Manager

Hi @Mathew , Leveraging multiple GPUs for batchwise inference with the Whisper model on Databricks can significantly enhance performance. While the Whisper model typically uses a single GPU, there’s a workaround to utilize multiple GPUs—one for the encoder and another for the decoder. Here’s how you can achieve this:

  1. Update the Whisper Package: First, ensure that you have the latest commit of the Whisper package. You can update it using the following command:

    pip install --upgrade --no-deps --force-reinstall git+[5](
  2. Load the Model and Distribute GPUs: In your Python code, load the Whisper model (e.g., “large”) and distribute the GPUs as follows:

    import whisper
    # Load the model (initially on CPU)
    model = whisper.load_model("large", device="cpu")
    # Move the encoder to the first GPU (cuda:0)"cuda:0")
    # Move the decoder to the second GPU (cuda:1)"cuda:1")
    # Register hooks to manage data flow between GPUs
        lambda _, inputs: tuple([inputs[0].to("cuda:1"), inputs[1].to("cuda:1")] + list(inputs[2:]))
        lambda _, inputs, outputs:"cuda:0")
    # Perform inference (e.g., transcribe an audio file)

    The code above uses register_forward_pre_hook to move the decoder’s input to the second GPU (“cuda:1”) and register_forward_hook to put the results back to the first GPU (“cuda:0”). The latter is not strictly necessary but serves as a workaround because the decoding logic assumes the outputs are on the same device as the encoder.

  3. VRAM Usage: After executing the snippet above, check the VRAM usage on your 2-GPU machine. It should distribute the load effectively between the GPUs.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.