Databricks Community

aleksandar · ‎10-30-2024

Databricks Observability using Grafana and Prometheus

As software systems scale, the amount of data they process and the work they do grows with them. This is not limited only to cloud-native services but also data pipelines that support them. When you have just a handful of data jobs and pipelines to worry about it’s not too difficult to reason about their metrics, however as the number of those pipelines increases over time it’s important to keep a pulse on them. That’s where observability comes in. Setting up a good observability foundation for the future can be essential for the cost, performance, and correctness of your pipelines.

What is Observability?

Originally the term Observability was defined to mean examining the state of the system based on the data it generates. Today, with the growth of cloud computing, it expanded to include different types of telemetry data - such as Logs, Metrics, and Traces (aka. the Three Pillars of Observability).

In this blog post, we’ll focus on collecting and processing various metrics from your Databricks Spark jobs using free, open source tools such as Prometheus and Grafana. In the next part, we’ll explore APM (Application Performance Monitoring) and explore runtime performance observability using Apache Pyroscope.

All source code in this post can be found in the Spark/Databricks Observability Demo on GitHub. It includes the necessary Terraform code to set up monitoring on the Databricks side, Docker Compose to spin up Grafana/Prometheus/Pyroscope, and other tools to help you get started.

Why it matters?

Observability is typically one of the -ilities, as in - Non-functional requirements which are used to judge the operation of a system as a whole rather than in specific behaviors. So what does that mean for you, a Software, DevSecOps, or a Data engineer reading this? Well, a few things. Have you ever considered topics like:

How long do my jobs run? Are they subject to seasonality and require different parameters on weekends than on workdays?
Am I overprovisioning my Spark clusters? Are my jobs autoscaling to tens of nodes, but due to IO constraints, I only use 50% of their CPU capacity?
Are my business metrics correct? Does the number of input rows in Table A match the number of rows written in Table B?

And the most important - how can I track these metrics over time and raise alerts if needed?

Enter Distributed Monitoring

Metrics are numerical representations of the health (and other aspects) of the system. Typically, we would collect CPU and RAM usage/utilization, memory consumption, network traffic, etc. In the world of Apache Spark™, this extends to JVM and performance utilization of Spark applications. We can capture information about JVM Heap usage, metrics about Spark Jobs and Tasks, scheduler, shuffle, etc.

Spark Applications fit in quite well into this vision - due to its nature as a distributed system, we should be able to easily aggregate metrics across our workloads and derive insights from them.

The Tools

There are many tools out there that can be used to collect, aggregate, and visualize metrics. Here we’ll focus on two popular yet free options - Prometheus and Grafana. Prometheus is used to collect metrics from various sources (e.g. both hardware and software metrics) and Grafana is mainly used to visualize them and build views and dashboards that show a bigger picture on top.

Spark actually comes with out-of-the-box support for Prometheus. However, it doesn’t cover all the metrics, only a subset of them, and it’s not really suitable for short-lived jobs, as we’ll see.

Prometheus Pushgateway

Prometheus is a pull-based metrics collector - that is, you have to configure it to periodically scrape certain metrics from an accessible endpoint. This works well for long-running services but it can be difficult to work around in case you have to monitor a lot of ephemeral workloads, such as one-time or periodic Jobs in Databricks.

Here is where Prometheus Pushgateway comes into the picture. Pushgateway exists so that short-lived jobs can push their metrics to it, and we can configure Prometheus to target Pushgateway as a source. The Pushgateway itself will transparently cache and proxy the metrics pushed to it. We can leverage that behavior by deploying the Pushgateway alongside our Prometheus instance and configuring our Databricks Job Clusters to push metrics to Pushgateway instead.

Setup

Setting up Databricks clusters with spark-metrics

Here, we’re using another library called spark-metrics. Internally, it integrates with Spark’s own observability model and pushes those metrics to Pushgateway. You can find an up-to-date form that works with Spark 3.5+ in the rayalex/spark-metrics fork, as well as pre-built binaries.

The easiest way to get going is to have an init-script that will install the necessary libraries and configure Pusgateway properly and in a reproducible way. We leverage init-scripts here to make sure the environment is ready before the Spark JVM process boots and configuration is picked up on startup.

The accompanying Terraform example will do these parts for you.

Setting up libraries

In order to hook into Spark’s JMX metrics - we need a compatible client that’s getting bootstrapped at runtime. As we mentioned earlier we’ll be using the spark-metrics library for this. There are a few ways of doing this, but for the sake of simplicity we’ll be placing the jar in our Workspace Files.

Finally, we can just copy the jar to its target location as part of our init-script:

cp /Workspace/Users/.../spark_metrics.jar /databricks/jars

Configuring Pushgateway

For the library to be able to actually do anything, we need to create its configuration. This will include things like Pushgateway endpoint, which metrics to send, job names, etc.

We’ll only show relevant parts here. However, you can look at the entire configuration for the full list of options. To make the script reusable, we’ll also use cluster environment variables to pull in some additional configuration options.

# configure spark metrics to use pushgateway as target
pushgatewayHost=$PROMETHEUS_HOST
jobName=$PROMETHEUS_JOB_NAME

cat >> /databricks/spark/conf/metrics.properties <<EOL
# Enable Prometheus for all instances by class name
*.sink.prometheus.class=org.apache.spark.banzaicloud.metrics.sink.PrometheusSink

# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=http
*.sink.prometheus.pushgateway-address=$pushgatewayHost
*.sink.prometheus.period=5
*.sink.prometheus.labels=job_name=$jobName

# Enable HostName in Instance instead of Appid (Default value is false i.e. instance=${appid})
*.sink.prometheus.enable-hostname-in-instance=true

# Enable JVM metrics source for all instances by class name
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
EOL

Configuring the Spark Clusters

The last step is to add the necessary configuration to the cluster. We’ve made it simpler to reuse the above script by injecting the configuration through your cluster’s environment variables. Currently, we only supply the job name as a metric tag, but you can expand this to fit your needs—e.g., adding specific IDs that are used internally, job version, run date, etc. Databricks itself will also supply a few metrics here, such as Databricks job run ID.

PROMETHEUS_HOST=10.10.10.0:9091
PROMETHEUS_JOB_NAME=dbx-demo-job

We also need to attach the init script to our interactive (or job) clusters. The Terraform demo we’re using will configure all of this for us.

And that’s it! Running our job now will provision a cluster, load the new jar into the classpath and if everything goes well, read the configuration before spinning up the Spark Context. Next, let’s set up the rest of the tools and see this in action!

Setting up Prometheus, Pushgateway, and Grafana

The final piece of the puzzle is to set up Prometheus itself (alongside Pushgateway) and Grafana (to visualize the results). Our example comes with an already ready-to-go docker-compose configuration that will deploy everything for you in your environment of choice—the only requirement is connectivity between your Databricks Cluster and the VM that’s running Pushgateway. If you’re just setting up your workspace, I recommend setting it up with VNet/VPC injection to make sure you have enough flexibility to deploy your networking as needed.

Let’s go through the components step by step. In the production system, you would probably deploy these components more resiliently and permanently, but for our example, it’s sufficient to use Docker Compose to demonstrate that it works.

Pushgateway

Pushgateway doesn’t require any specific configuration. You can just run it as-is, and it will listen to metrics on port 9091.

pushgateway:
  image: prom/pushgateway
  command: --web.enable-admin-api
  ports:
    - '9091:9091'

Prometheus

Prometheus requires some configuration to work. Most importantly, we have to tell it to scrape metrics residing in Pushgateway occasionally. In our compose example, we provide the configuration as a mounted volume.

prometheus:
  image: prom/prometheus
  ports:
    - '9090:9090'
  volumes:
    - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
  command: --config.file=/etc/prometheus/prometheus.yml

And the configuration file itself:

global:
  scrape_interval:     5s
  evaluation_interval: 5s

scrape_configs:
  - job_name: 'pushgateway'
    honor_labels: true
    honor_timestamps: true
    metrics_path: '/metrics'
    static_configs:
      - targets: ['pushgateway:9091']

Here we’re telling Prometheus to scrape our (docker-compose) Pushgateway endpoint every 5 seconds.

Grafana

Once we have our metrics ingested, the last step is to explore and visualize them. Grafana also requires very little configuration, and it’s easy to extend it to consume metrics from Prometheus if you’re not already using it.

Docker Compose configuration is fairly simple:

  grafana:
    image: grafana/grafana
    ports:
      - '3000:3000'
    volumes:
      - ./config/grafana/datasources:/etc/grafana/provisioning/datasources/

We can add the Prometheus Datasource through the UI, or programmatically as we’re doing here:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    editable: true

And we’re done. Next time our job runs (and in the case of interactive clusters, while they’re up), we should see the metrics show up in Grafana. Let’s explore a few of the examples.

Examples

A basic dashboard that shows information about simple PiCalculator Databrick Job, which runs every 5 minutes. You can see the number of executors, their average memory usage, GC statistics, and overall Spark tasks.

Conclusion

We’ve seen how we can easily and effectively gain insights into our Databricks applications metrics, using free and open source tools. In the next part, we’ll be exploring Application Performance Monitoring with Pyroscope and show how we can dive deep into Spark internals and discover how can we track application performance in detail and monitor how it changes over time. Stay tuned!

NandiniN · ‎10-31-2024

Thanks for sharing these @aleksandar ! Really insightful.

devron

Thank you @aleksandar . Can you provide the link for Grafana dashboard file