When running distributed training or batch inference on multi-node GPU clusters with Spark, the GPUs on the Driver node often remain underutilized, resulting in unnecessary waste of GPU resources. The figures below illustrate this issue: Fig.1: Only one GPU in Driver node is being utilized
Fig.2: All GPUs in the Worker node are utilized.
Solution: Heterogeneous Compute Types for Driver and Worker Nodes
To address this problem, you can select different compute types for the Driver and Worker nodes. For example, you might choose a CPU instance for the Driver node and a GPU instance for the Worker nodes.
Currently, the cluster UI does not support heterogeneous compute types. However, you can create such a cluster using the following API command:
clusters create --json '{
"cluster_name": "xxxx",
"spark_version": "14.3.x-gpu-ml-scala2.12",
"node_type_id": "g4dn.12xlarge",
"driver_node_type_id": "i3.xlarge",
"autoscale" : { "min_workers": 1, "max_workers": 2 },
"aws_attributes" : {"first_on_demand": 3}
}'
#automl #modeltraining #mosaicai