Ganglia not working with custom container services
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-10-2023 06:04 PM
Setup:
- custom docker container starting from the "databricksruntime/gpu-conda:cuda11" base image layer
- 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)
- multi-node, p3.8xlarge GPU compute
When I try to view Ganglia metrics I am met with "502 Bad Gateway":
Even after ~1hr of my compute cluster running there are no logs at all:
As a sanity check I booted another compute without a custom docker container (using 11.3 LTS ML (includes Apache Spark 3.3.0, GPU, Scala 2.12)) and the Ganglia metrics work fine.
Are there any limitations with Ganglia metrics and custom docker containers?
Also when I am using the custom docker container, I am forced to use the standard runtime (10.4 LTS) as the Machine Learning runtimes do not support custom containers (see https://docs.databricks.com/clusters/custom-containers.html#requirements).
I am thinking this could be a source of the issue too. Does the ML runtime provide any needed libraries for Ganglia to work on GPU compute?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 04:38 AM
Hi @James W , Ganglia is not available for custom docker containers by default. This is a known limitation.
However, you can try this experimental support for ganglia in custom DCS:
https://github.com/databricks/containers/tree/master/experimental/ubuntu/ganglia

