Databricks Community

megha_upadhyay · ‎05-28-2025

In the realm of big data analytics, Databricks has become a leading platform for managing and processing large datasets efficiently. While Databricks Runtime is highly optimized out of the box and features like serverless compute further simplify cluster management, understanding cluster metrics remains essential for advanced troubleshooting and fine-tuning in specific scenarios.

This blog focuses specifically on decoding hardware metrics (e.g., CPU, memory, and network utilization) and Spark metrics (e.g., task execution, shuffle operations) in Databricks clusters. While GPU metrics are available for specialized machine learning workloads on GPU-enabled clusters, they are outside the scope of this discussion.

By the end of this blog, you will gain a deeper understanding of these key metrics and practical strategies for interpreting and optimising them, ensuring your Databricks clusters are well utilized to run workloads in most efficient and optimal way.

Let’s dive into the blog sections for a closer look:

Overview of Databricks Cluster Metrics

Databricks cluster metrics are crucial for monitoring performance, identifying bottlenecks, and optimizing resource utilization. These metrics provide insights into how your cluster is performing, helping you make informed decisions about scaling, configuration, and task optimization. The metrics are updated in near real-time, typically with less than one minute delay. Available for both all-purpose and job compute, these metrics are stored in Databricks-managed storage rather than customer storage. The current metrics system offers a more comprehensive view of cluster resource usage compared to the older Ganglia UI. This enhanced visibility allows data engineering or platform teams to better understand performance patterns and maximize cluster efficiency.

Accessing Metrics

To view these metrics, users can navigate to the "Compute" section in the sidebar, select the desired compute resource, and click the "Metrics" tab. Hardware metrics are displayed by default, but users can switch to view Spark metrics or GPU metrics (if the instance is GPU-enabled) using the dropdown menu. One can filter over various date ranges to view historical metrics and up to the past 30 days.

Understanding Databricks Cluster Metrics

Hardware Metrics

Hardware metrics, which can be monitored at either the compute or individual driver/worker node level, are essential for evaluating the physical performance of your cluster nodes. Understanding these metrics allows you to identify resource bottlenecks and optimize cluster configurations.

CPU Utilization

CPU utilization is one of the most critical metrics, as it directly impacts the processing power of your cluster. Understanding the different modes of CPU usage is essential:

User Mode: Represents the CPU time spent running user-level code, which includes application-level processes such as executing scripts, running queries, or performing computations in your Databricks Notebooks. This is typically where most application logic resides.
System Mode: Represents the CPU time spent running system-level code, which includes kernel-level operations like managing memory, handling I/O requests, and interacting with hardware.
Idle Mode: Represents the CPU time spent in an idle state when there are no active tasks to process.
IOWait Mode: This represents the CPU time spent waiting for I/O operations (e.g., reading from or writing to disks) to complete.
Other modes: Databricks also tracks additional CPU usage modes, including

irq (Time spent on interrupt requests)
softirq (Time spent handling software interrupt requests)
nice (Time spent on lower-priority processes)
steal (Time other VMs took from your CPUs)

Interpretation and Mitigation:

High CPU utilization isn't always bad. It can also indicate that a cluster is being utilized optimally. However, consistently high utilization beyond 85-90% can indicate a bottleneck, where tasks are competing for processing power.
- Scale-Up Instances: If CPU utilization remains consistently high, increasing the number of nodes or upgrading to more powerful cluster instances to distribute the workload might be necessary.
High IOWait suggests that tasks are bottlenecked by slow disk I/O operations, such as reading or writing shuffle files or reading large datasets from storage.
- Optimize data storage formats by using Delta Lake for faster reads/writes.
- Use OPTIMIZE table_name to improve data layout for Delta tables or use Predictive Optimization, which removes the need to manually manage maintenance operations for Unity Catalog managed tables on Databricks.
- Reduce the amount of data scanned from storage by using data skipping techniques, such as z-ordering or liquid clustering.
- Use broadcast joins for smaller datasets to avoid shuffles altogether.
- One key factor influencing disk I/O during shuffle operations is the Spark property spark.sql.shuffle.partitions. This setting controls how many partitions are created during wide transformations like joins and aggregations.

If the number of shuffle partitions is set too low, each partition becomes large, increasing the likelihood that data will not fit in memory and will spill to disk. This disk spilling causes higher IOWait, as tasks spend more time reading from and writing to disk rather than processing in memory. Conversely, setting the number of partitions too high can lead to excessive overhead from managing many small tasks, which also impacts performance but in a different way—by increasing task scheduling and coordination costs..

Tuning spark.sql.shuffle.partitions helps balance these trade-offs:

-> Too few partitions: Larger partitions, more disk spills, higher IOWait.

-> Too many partitions: More scheduling overhead.

Alternatively, use AQE (enabled by default since Apache Spark™ 3.2.0) to auto-adjust partitions or set the Spark property spark.sql.shuffle.partitions to auto, which enables auto-optimized shuffle by automatically determining this number based on the query plan and the query input data size.

High idle time may indicate underutilization of cluster resources, suggesting that your cluster may be overprovisioned and potentially allowing cluster downsizing
- Enable autoscaling to dynamically adjust the number of workers based on workload demand.
  - Note: Spark may not always be idle when CPU utilization is low. To investigate further, check the Spark UI.
- For interactive clusters, one can set the auto-terminate value to reduce idle time.

Memory Utilization

Proper memory management ensures smooth workload execution, prevents out-of-memory (OOM) errors, and maintains cluster stability.

Used Memory: Represents the memory currently allocated to processes, including Spark executors and drivers.
Free Memory: Indicates the amount of memory available for allocation. High free memory may suggest underutilized resources.
Buffer Memory: Represents the memory used by buffers to temporarily store data during I/O operations.
Cached Memory: Represents the memory allocated for caching data to improve read performance by avoiding repeated disk I/O.

Interpretation and Mitigation:

Consistently high memory usage can indicate that your workload is memory-intensive or that there are potential inefficiencies in how memory is being utilized (e.g., large shuffles or caching too much data). This may lead to OOM errors or degraded performance.
If free memory is consistently low, it could mean that the cluster is running close to capacity, which also increases the risk of OOM errors during peak workloads.
- Increase Memory: Upgrade the instance type of your cluster to a higher RAM capacity that can handle memory-intensive tasks.

High buffer memory usage may indicate frequent I/O operations, such as writing shuffle files or reading large datasets from storage.
- Mitigation follows the same approach as discussed in the High IOWait section under CPU Utilization. Refer above.

Cached memory is beneficial for improving performance, but excessive caching can lead to memory contention, especially if large datasets are cached unnecessarily.
- Use selective caching by only caching frequently accessed datasets using df.cache() or df.persist() followed by df.unpersist().

Note: Refer here to understand caching better and when to use caching.

Network Metrics

Network metrics are essential for monitoring the data transfer activity within Databricks clusters. They provide insights into the volume of data received and transmitted by cluster nodes, helping to identify bottlenecks caused by inefficient data movement or network congestion.

Received Data: Represents the total volume of data received through the network by cluster nodes.
Transmitted Data: Represents the total volume of data sent through the network by cluster nodes.

Interpretation and Mitigation:

High network usage can result from inefficient data transfer patterns, such as frequent large shuffle operations or cross-region data transfers. This can lead to slower task execution due to network congestion.
Transferring data between regions incurs latency and additional egress costs because compute resources are in one region and storage is in another.
- Keep compute and storage resources in the same region to avoid cross-region transfer fees.

Additional Hardware Metrics

Other valuable hardware metrics include:

Server load distribution: Represents CPU utilization for each node in the cluster, including both driver and worker nodes.

Interpretation and Mitigation:

If the driver node is highly utilized compared to workers(often visualized as a single red square among blue ones in cluster metrics), it may indicate non-distributed work or, in streaming jobs, that the driver is overwhelmed by managing too many streams or short trigger intervals.
- Avoid operations like .collect() that put data on the driver.
- Consider splitting streams across clusters, increasing driver resources, or optimizing the number and configuration of streams.
If some workers are much more utilized than others, it can indicate data skew or imbalanced partitioning, causing inefficient parallelism.
- Optimize data partitioning and leverage Adaptive Query Execution (AQE) for automatic skew handling in non-streaming jobs.

Note: AQE is enabled by default in Databricks Runtime (DBR) starting from version 7.3 for batch queries. For streaming workloads using the ForeachBatch sink, AQE is enabled by default from DBR 13.1 on non-Photon clusters and from DBR 13.2 on Photon clusters.

Free filesystem space: Measures available storage across all mount points on cluster nodes, which is crucial for both spill management and disk caching.

Interpretation and Mitigation:

Low Free Space can cause Spark operations to fail when spilling intermediate data to disk, or limit the effectiveness of disk caching (used to accelerate data reads).
- Monitor disk space regularly, and optimize jobs to minimize spills (e.g., tune shuffle partitions and memory settings).
- Use disk cache primarily for accelerating reads, but be mindful not to overload your cluster with unnecessary or repetitive reads of large datasets.

Memory swap utilization: Represents the memory swap usage patterns(measured in bytes).
Number of active nodes: Count of currently active nodes in the compute cluster.

Spark Metrics

Spark metrics provide critical insights into the execution and performance of Spark jobs running on your Databricks cluster. These metrics help identify inefficiencies, optimize resource usage, and ensure smooth job execution. Unlike hardware metrics, Spark metrics are aggregated at the compute level rather than the individual node level, offering a broader view of workload performance.

Task Metrics

Task metrics track the execution status and performance of individual tasks within a Spark job. These include:

Active Tasks: Represents the total number of tasks executing at a given time
Total Failed Tasks: Tracks the total number of tasks that have failed in executors over a specific time interval.
Total Completed Tasks: Represents the total number of tasks that have been successfully completed by executors over a specific time interval.
Total Number of Tasks: Aggregates all tasks (active, failed, and completed) executed during a specific time interval. Provides an overall view of task volume and workload distribution

Interpretation and Mitigation:

A high number of active tasks can indicate that the cluster is handling a large workload. However, if this leads to slow task execution or resource contention, it may suggest inefficient task scheduling or insufficient resources.
High task failure rates often result from issues like data skew (uneven data distribution), memory constraints, or incorrect configurations.
- Address data skew by repartitioning datasets evenly or using salting techniques for highly skewed keys.

Shuffle Read/Write

Shuffle operations involve redistributing data across nodes during Spark jobs. Key shuffle metrics include:

Shuffle Read: Represents the total size of data read during shuffle operations.
Shuffle Write: Represents the total size of data written during shuffle operations.

Interpretation and Mitigation:

High Shuffle Read/Write sizes can significantly impact performance due to increased disk I/O and network traffic.
- Reduce shuffled data through filtering and projection. Apply filters (e.g., WHERE clauses) as early as possible in your query to reduce the volume of data entering shuffle stages and use SELECT to prune unused columns before shuffling. While not always feasible, this approach is particularly effective for wide datasets with many columns.
- Optimize partitioning strategies by ensuring even data distribution across partitions.
- Use broadcast joins for smaller datasets to avoid shuffles altogether.

Task Duration

This metric tracks the total elapsed time the JVM spent executing tasks on executors, measured in seconds.

Interpretation and Mitigation:

If the task duration chart grows rapidly or exponentially, it may signal processing bottlenecks or resource allocation issues, as long task durations often indicate inefficiencies that need investigation.
- Optimize Code Efficiency: Review and optimize code for better performance.
- Adjust Resource Allocation: Ensure tasks have sufficient resources (CPU, memory) allocated.

Some Best Practices

Instance Types

Choose an instance family based on the workload type:

Use compute-optimized instances for tasks requiring high processing power. (e.g., ELT with full scans and no data reuse, Structured Streaming jobs, and running OPTIMIZE and Z-order Delta commands)
Use memory-optimized instances for tasks requiring high memory. (e.g., where a lot of shuffle and spills are involved or when Spark caching is needed)
Use storage-optimized instances for ad hoc and interactive data analysis or to leverage Delta caching
Use GPU-optimized instances for ML and DL workloads with an exceptionally high memory requirement
Go for General Purpose only in the absence of specific requirements or to run VACUUM Delta Command

Photon

Photon is the next-generation engine on the Databricks Lakehouse Platform that provides extremely fast query performance at a low cost. Photon is compatible with Apache Spark™ APIs, works out of the box with existing Spark code, and provides significant performance benefits over the standard Databricks Runtime.

It is recommended to enable Photon (after evaluation and performance testing) for workloads with the following characteristics:

ETL pipelines consisting of Delta MERGE operations
Writing large volumes of data to cloud storage (Delta/Parquet)
Scans of large datasets, joins, aggregations and decimal computations
Auto Loader to incrementally and efficiently process new data arriving in storage
Interactive/ad hoc queries using SQL

Note: Databricks recommends using an instance type with at least 4 GB / core when using Photon.

Autoscaling

Enable autoscaling to dynamically adjust the number of workers based on workload demand. This prevents underutilization during idle periods and ensures adequate resources during peak loads.

Note: Autoscaling is generally not recommended for latency-sensitive workloads or traditional Structured Streaming jobs. Scaling up or down can introduce delays as new nodes are provisioned or removed, which may impact the consistent low-latency performance required by these workloads. For such scenarios, using a fixed-size cluster is preferable to ensure predictable response times.

Number of Workers

Choosing the right number of workers requires some trials and iterations to determine the compute and memory needs of a Spark job. [For more details, refer here]

Compute-Intensive Workloads: For workloads that require significant processing power, increasing the number of workers allows you to distribute the load more effectively and achieve better parallelism.
Memory-Intensive Workloads: For jobs that require large amounts of memory (such as caching large datasets or running memory-heavy transformations), it’s often better to use fewer but more powerful worker nodes with higher memory capacity.
Operation Type Matters: The type of Spark operations in your workload can also influence your worker configuration:

Narrow Transformations (No Shuffle): Workloads that do not require shuffling (such as simple map or filter operations) can benefit from using a larger number of smaller instances. This setup maximizes parallelism without incurring significant network overhead, as data does not need to be moved between nodes.
Wide Transformations (Shuffle-Heavy): Workloads that involve wide operations (such as groupBy, join, or reduceByKey) trigger shuffles, which require data to be exchanged between nodes. In these cases, using fewer but larger instances is advantageous, as more data can be processed locally on each node, reducing the amount of network transfer and improving shuffle performance.

Note: It is generally recommended to use a smaller number of larger virtual machines.

Using Databricks System Tables for Monitoring

Databricks provides several system tables that can be used to monitor cluster metrics and identify bottlenecks:

system.compute.clusters Table

Tracks cluster configurations over time, including worker count, instance types, and autoscaling settings.
Use this table to audit cluster changes and ensure configurations align with workload requirements.

system.compute.node_timeline Table

Captures minute-by-minute resource utilization metrics such as CPU, memory, and network usage per node.
Analyze this table to identify underutilized resources or nodes experiencing high contention.

system.compute.node_types Table

Provides details about available node types, including vCPU count and memory capacity.
Use this table to select optimal instance types based on workload needs.

Sample queries

Identify Jobs that consumed the highest DBUs in the past month

SELECT

  usage_metadata.job_id AS job_id,

  SUM(usage_quantity) AS total_dbus_consumed

FROM system.billing.usage

WHERE usage_metadata.job_id IS NOT NULL AND usage_date >= NOW() - INTERVAL 30 DAY

GROUP BY usage_metadata.job_id

ORDER BY total_dbus_consumed DESC;

Monitor DBU Usage by Cluster

SELECT

   sku_name,

   SUM(usage_quantity) AS total_dbus_consumed,

   DATE_TRUNC('day', usage_date) AS day

FROM system.billing.usage

WHERE usage_date >= NOW() - INTERVAL 30 DAY

GROUP BY sku_name, DATE_TRUNC('day', usage_date)

ORDER BY total_dbus_consumed DESC;

Monitor Cluster Resource Utilization

SELECT

   cluster_id,

   AVG(cpu_user_percent) AS avg_cpu_utilization_percent,

   AVG(mem_used_percent) AS avg_memory_utilization_percent,

   DATE_TRUNC('hour', end_time) AS hour

FROM system.compute.node_timeline

WHERE end_time >= NOW() - INTERVAL 1 DAY

GROUP BY cluster_id, DATE_TRUNC('hour', end_time)

ORDER BY avg_cpu_utilization_percent DESC;

Identify Underutilized Clusters

SELECT

   cluster_id,

   AVG(cpu_user_percent) AS avg_cpu_utilization_percent,

   COUNT(*) AS data_points

FROM system.compute.node_timeline

WHERE end_time >= NOW() - INTERVAL 7 DAY

GROUP BY cluster_id

HAVING avg_cpu_utilization_percent < 0.2

ORDER BY avg_cpu_utilization_percent ASC;

Conclusion

Understanding and interpreting Databricks cluster metrics is essential for optimizing performance and resource utilization. By applying the strategies outlined in this guide, you can ensure that your clusters run efficiently, reduce costs, and improve overall productivity. Remember, proactive monitoring and timely mitigation are key to maintaining high-performing Databricks environments.

Databricks Community

Decoding Databricks Cluster Metrics: A Guide to Interpretation and Mitigation

Overview of Databricks Cluster Metrics

Accessing Metrics

Understanding Databricks Cluster Metrics

Hardware Metrics

CPU Utilization

Memory Utilization

Network Metrics

Additional Hardware Metrics

Spark Metrics

Task Metrics

Shuffle Read/Write

Task Duration

Some Best Practices

Instance Types

Photon

Autoscaling

Number of Workers

Using Databricks System Tables for Monitoring

Sample queries

Conclusion

References

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks