cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Lanz
Databricks Employee
Databricks Employee

When running distributed training or batch inference on multi-node GPU clusters with Spark, the GPUs on the Driver node often remain underutilized, resulting in unnecessary waste of GPU resources. The figures below illustrate this issue: Fig.1: Only one GPU in Driver node is being utilizedFig.1: Only one GPU in Driver node is being utilized

Fig.2: All GPUs in the Worker node are utilized.Fig.2: All GPUs in the Worker node are utilized.

Solution: Heterogeneous Compute Types for Driver and Worker Nodes

To address this problem, you can select different compute types for the Driver and Worker nodes. For example, you might choose a CPU instance for the Driver node and a GPU instance for the Worker nodes.

Currently, the cluster UI does not support heterogeneous compute types. However, you can create such a cluster using the following API command:

 

clusters create --json '{
  "cluster_name": "xxxx",
  "spark_version": "14.3.x-gpu-ml-scala2.12",
  "node_type_id": "g4dn.12xlarge",
  "driver_node_type_id": "i3.xlarge",
  "autoscale" : { "min_workers": 1, "max_workers": 2 },
  "aws_attributes" : {"first_on_demand": 3} 
}'

 

 

#automl #modeltraining #mosaicai