Databricks Community

jamesw · ‎01-10-2023

Setup:

custom docker container starting from the "databricksruntime/gpu-conda:cuda11" base image layer
10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)
multi-node, p3.8xlarge GPU compute

When I try to view Ganglia metrics I am met with "502 Bad Gateway":

Even after ~1hr of my compute cluster running there are no logs at all:

As a sanity check I booted another compute without a custom docker container (using 11.3 LTS ML (includes Apache Spark 3.3.0, GPU, Scala 2.12)) and the Ganglia metrics work fine.

Are there any limitations with Ganglia metrics and custom docker containers?

Also when I am using the custom docker container, I am forced to use the standard runtime (10.4 LTS) as the Machine Learning runtimes do not support custom containers (see https://docs.databricks.com/clusters/custom-containers.html#requirements).

I am thinking this could be a source of the issue too. Does the ML runtime provide any needed libraries for Ganglia to work on GPU compute?