Databricks Community

Mathew · ‎02-19-2024

I am currently experimenting with the whisper model for batchwise inference on Databricks and have successfully utilized multiple instances of the model by accessing multiple GPUs available in the driver node. However, I am wondering how I can leverage the multiple GPUs present in each of the worker nodes, as I am unable to access them. I have come across documentation on utilizing all worker nodes with pyspark-based libraries, but I am specifically interested in how to achieve this with a transformer model like whisper. Any insights or suggestions would be greatly appreciated.

Kaniz_Fatma · ‎02-19-2024

Hi @Mathew , Leveraging multiple GPUs for batchwise inference with the Whisper model on Databricks can significantly enhance performance. While the Whisper model typically uses a single GPU, there’s a workaround to utilize multiple GPUs—one for the encoder and another for the decoder. Here’s how you can achieve this:

Update the Whisper Package: First, ensure that you have the latest commit of the Whisper package. You can update it using the following command:
```
pip install --upgrade --no-deps --force-reinstall git+[5](https://github.com/openai/whisper.git)
```

Load the Model and Distribute GPUs: In your Python code, load the Whisper model (e.g., “large”) and distribute the GPUs as follows:

import whisper

# Load the model (initially on CPU)
model = whisper.load_model("large", device="cpu")

# Move the encoder to the first GPU (cuda:0)
model.encoder.to("cuda:0")

# Move the decoder to the second GPU (cuda:1)
model.decoder.to("cuda:1")

# Register hooks to manage data flow between GPUs
model.decoder.register_forward_pre_hook(
    lambda _, inputs: tuple([inputs[0].to("cuda:1"), inputs[1].to("cuda:1")] + list(inputs[2:]))
)
model.decoder.register_forward_hook(
    lambda _, inputs, outputs: outputs.to("cuda:0")
)

# Perform inference (e.g., transcribe an audio file)
model.transcribe("jfk.flac")

The code above uses register_forward_pre_hook to move the decoder’s input to the second GPU (“cuda:1”) and register_forward_hook to put the results back to the first GPU (“cuda:0”). The latter is not strictly necessary but serves as a workaround because the decoding logic assumes the outputs are on the same device as the encoder.

VRAM Usage: After executing the snippet above, check the VRAM usage on your 2-GPU machine. It should distribute the load effectively between the GPUs.

Databricks Community

How can I utilize multiple GPUs from multiple nodes in Databricks

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 10 October - 31 October

Databricks Community Social | 30 September 2024 | 8AM PST

Intelligent Data Engineering: Beyond the AI Hype

GenAI: The Shift to Data Intelligence

Big Book of Data Engineering — 3rd Edition