Databricks Community

Shahe · ‎06-04-2024

What is the best method to expose Azure Databricks metrics to Prometheus specifically? And is it possible to get the underlying Spark metrics also? All I can see clearly defined in the documentation is the serving endpoint metrics:

https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/metrics-export-ser...

Please advise, thanks.

anardinelli · ‎06-04-2024

Hi @Shahe how are you?

The standard way of doing so is to first configure your spark to enable Prometheus using:

spark.ui.prometheus.enabled true
spark.metrics.namespace <app_name>

We recommend replacing <app_name> with job names for job clusters.

Then, create an init script and attach to the cluster

dbutils.fs.put("dbfs:/dlyle/install-prometheus.sh",
"""
#!/bin/bash
cat <<EOF > /databricks/spark/conf/metrics.properties
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus
EOF
cat >/databricks/driver/conf/00-custom-spark.conf <<EOF
[driver] {  
  spark.sql.streaming.metricsEnabled = "true"
  spark.metrics.appStatusSource.enabled = "true"
}
EOF
""", True)

Confirm the driver port by running:

spark.conf.get("spark.ui.port")

Make a note of the driver port. You'll need this when setting up the scrape target.

Generate a personal access token as shown here
Add the following to your prometheus_values.yaml

extraScrapeConfigs: |
   - job_name: '<cluster_name>'
     metrics_path: /driver-proxy-api/o/0/<cluster_id>/<spark_ui_port>/metrics/prometheus/
     static_configs:
       - targets:
         - <workspace_url>
     authorization:
     # Sets the authentication type of the request.
       type: Bearer
       credentials: <personal access token>

Update your helm installation: helm upgrade <installation_name - i.e. dlyle-prometheus> -f prometheus_values.yaml -n <namespace - i.e. prometheus-monitoring> --create-namespace prometheus-community/prometheus

Prometheus server has a sidecar container that should automatically update the config. You can also force a restart: kubectl rollout restart deployment <installation_name> -n <namespace>

At this point, you should be able to see your cluster as a scrape target in Prometheus server.

Follow this link for more refference:

https://docs.databricks.com/en/machine-learning/model-serving/metrics-export-serving-endpoint.html