cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to access metrics from Driver node on localhost:4040

vishal_balaji
Visitor

Greetings,

I am trying to setup monitoring in Grafana for all my databricks clusters

I have added 2 things as part of this

Under Compute > Configuration > Advanced > Spark > Spark Config, I have added
spark.ui.prometheus.enabled true

Under init_scripts, I have this script

#!/bin/bash

cat > /databricks/spark/conf/jmxCollector.yaml <<EOF
lowercaseOutputName: false
lowercaseOutputLabelNames: false
whitelistObjectNames: ["*:*"]
EOF

cat >> /databricks/spark/conf/metrics.properties <<EOF
# Enable Prometheus for all instances by class name
driver.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
executor.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
driver.sink.prometheusServlet.path=/metrics/prometheus
executor.sink.prometheusServlet.path=/metrics/executor/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus

*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource

# *.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink
# *.sink.console.period=120
# driver.sink.console.unit=seconds
EOF
 
However I am not able to access these metrics on localhost:4040 when I try to connect to the cluster. I tried doing
gives
curl: (7) Failed to connect to localhost port 4040 after 1 ms: Couldn't connect to server
 
Directly connecting to Driver IP gives an empty response

* Connected to 10.4.86.136 (10.4.86.136) port 37479

> GET /metrics/prometheus HTTP/1.1

> Host: 10.4.86.136:37479

> User-Agent: curl/8.5.0

> Accept: */*

< * Empty

reply from server 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0

* Closing connection

curl: (52) Empty reply from server

 
  1. Am I configuring something wrong here? Why is the endpoint not reachable via localhost:4040 like it's mentioned in the docs - https://spark.apache.org/docs/latest/monitoring.html#metrics
  2. Why am I getting an empty response from DRIVER_IP/metrics/prometheus? I got to try that from here - https://stackoverflow.com/questions/70989641/spark-executor-metrics-dont-reach-prometheus-sink
  3. If I have to access this only through the DRIVER_IP, how do I get access to this within the context of the init_script?
2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi @vishal_balaji ,

You're following guides that were prepared for OSS Apache Spark. For sure localhost won't work in this case because in Databricks all compute is cloud-based. 

Please follow below guide how to configure it properly on databricks:

Databricks Observability using Grafana and Prometheus

Hi @szymon_dybczak ,

Thanks for the quick response. We initially tried making Pushgateway work, but this seems to be designed for tracking metrics related to ephemeral batch jobs.

We are trying to track metrics for streaming jobs, which the pushgateway is not able to handle because it stores all metrics in memory and quickly runs out of memory in the host machine.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now