💡 ML Training Tip Of The Week #1: Optimizing GPU Utilization in Multi-Node Spark Clusters

Technical Blog

Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.

When running distributed training or batch inference on multi-node GPU clusters with Spark, the GPUs on the Driver node often remain underutilized, resulting in unnecessary waste of GPU resources. The figures below illustrate this issue: Fig.1: Only one GPU in Driver node is being utilized

Fig.2: All GPUs in the Worker node are utilized.

Solution: Heterogeneous Compute Types for Driver and Worker Nodes

To address this problem, you can select different compute types for the Driver and Worker nodes. For example, you might choose a CPU instance for the Driver node and a GPU instance for the Worker nodes.

Currently, the cluster UI does not support heterogeneous compute types. However, you can create such a cluster using the following API command:

clusters create --json '{
  "cluster_name": "xxxx",
  "spark_version": "14.3.x-gpu-ml-scala2.12",
  "node_type_id": "g4dn.12xlarge",
  "driver_node_type_id": "i3.xlarge",
  "autoscale" : { "min_workers": 1, "max_workers": 2 },
  "aws_attributes" : {"first_on_demand": 3} 
}'

#automl #modeltraining #mosaicai