Databricks Community

saicharandeepb · ‎06-24-2025

Hi everyone,

I'm working on building a hardware metrics dashboard using the system.compute schema in Databricks, specifically leveraging the cluster, node_type, and node_timeline tables.

While analyzing the data, I came across something that seems contradictory to common industry guidance:

It's generally accepted that if I/O wait exceeds 10%, it indicates CPU performance degradation due to the processor waiting on disk or network I/O.

However, in several cases from my data, I noticed that even when cpu_wait_percent is greater than 10%, the cpu_user_percent is still above 90% — which suggests the CPU is actively doing useful work.

This seems counterintuitive. Shouldn't high I/O wait reduce the CPU's ability to perform user-level tasks?

Has anyone else observed this behavior in Databricks or Spark environments? Could this be due to how metrics are sampled or reported in the system tables? Or is it possible that multiple cores are being utilized in parallel, masking the impact of I/O wait?

Any insights or explanations would be greatly appreciated!

Thanks in advance!

mark_ott · ‎10-01-2025

Your observation highlights a subtlety in interpreting CPU metrics, especially in distributed environments like Databricks, where cluster and node-level behaviors can diverge from typical single-server intuition.

Direct Answer

No, seeing both high cpu_user_percent (e.g., >90%) and high cpu_wait_percent (e.g., >10%) on a node or cluster is not necessarily a contradiction. While commonly high I/O wait signals that CPUs are stalled waiting for disk or network, in multi-core or distributed systems (like Databricks clusters), these metrics represent averaged or aggregated states—so it’s possible for one part of the CPU to be processing (user-level work) while another is waiting for I/O at the same time. Both can be high, especially if workloads are spiky or heterogeneous across available resources.

Explanation of Metrics

cpu_user_percent: Indicates the percentage of CPU processing time spent on user processes (your workloads).
cpu_wait_percent (I/O wait): Measures the proportion of time the CPU spends idle, waiting for I/O (disk or network) to complete.

Why Both Can Be High

Parallelism Across Cores: On a multi-core node, some cores might be fully occupied doing user work, while others are stalled waiting for I/O. The reported user and wait percentages are often averaged across cores. So, high values for both can indicate mixed workloads.
Workload Interleaving: Even on a single core (over time), Databricks’ speculative execution and task scheduling can lead to periods where the CPU is simultaneously handling tasks with heavy computation needs and others bottlenecked by I/O.
System/Metric Reporting: Aggregation over time windows can mask rapid context switches between waiting and processing, making both metrics appear elevated when, in fact, the system is rapidly oscillating between these states.

Industry Guidance Context

Classic guidance (high I/O wait = CPU bottlenecked by slow disk/network, reducing compute usefulness) holds better on single-core or less parallelized servers.
On distributed, parallel systems (Spark clusters like Databricks), multiple tasks, often of different types (I/O-bound vs. CPU-bound), run simultaneously. Such scenarios can lead to high user and wait (and sometimes even system) percentages concurrently.

Recommendations

Granular Analysis: Break down the metrics by core, by time window, and by process/task type (if possible).
Investigate Specific Nodes/Tasks: Look for nodes consistently showing high wait and user simultaneously—isolated instances may be less concerning than cluster-wide patterns.
Cross-check With Other Metrics: Review disk, network throughput, GC pauses, and task-level latencies to better diagnose performance bottlenecks.

In summary, your data is plausible given Databricks’ architecture and Spark’s workload profile. Focus on continuous profiling and drill down to root causes if these patterns align with degradations in job performance or SLA breaches.

View solution in original post

mark_ott · ‎10-01-2025

Your observation highlights a subtlety in interpreting CPU metrics, especially in distributed environments like Databricks, where cluster and node-level behaviors can diverge from typical single-server intuition.

Direct Answer

No, seeing both high cpu_user_percent (e.g., >90%) and high cpu_wait_percent (e.g., >10%) on a node or cluster is not necessarily a contradiction. While commonly high I/O wait signals that CPUs are stalled waiting for disk or network, in multi-core or distributed systems (like Databricks clusters), these metrics represent averaged or aggregated states—so it’s possible for one part of the CPU to be processing (user-level work) while another is waiting for I/O at the same time. Both can be high, especially if workloads are spiky or heterogeneous across available resources.

Explanation of Metrics

cpu_user_percent: Indicates the percentage of CPU processing time spent on user processes (your workloads).
cpu_wait_percent (I/O wait): Measures the proportion of time the CPU spends idle, waiting for I/O (disk or network) to complete.

Why Both Can Be High

Parallelism Across Cores: On a multi-core node, some cores might be fully occupied doing user work, while others are stalled waiting for I/O. The reported user and wait percentages are often averaged across cores. So, high values for both can indicate mixed workloads.
Workload Interleaving: Even on a single core (over time), Databricks’ speculative execution and task scheduling can lead to periods where the CPU is simultaneously handling tasks with heavy computation needs and others bottlenecked by I/O.
System/Metric Reporting: Aggregation over time windows can mask rapid context switches between waiting and processing, making both metrics appear elevated when, in fact, the system is rapidly oscillating between these states.

Industry Guidance Context

Classic guidance (high I/O wait = CPU bottlenecked by slow disk/network, reducing compute usefulness) holds better on single-core or less parallelized servers.
On distributed, parallel systems (Spark clusters like Databricks), multiple tasks, often of different types (I/O-bound vs. CPU-bound), run simultaneously. Such scenarios can lead to high user and wait (and sometimes even system) percentages concurrently.

Recommendations

Granular Analysis: Break down the metrics by core, by time window, and by process/task type (if possible).
Investigate Specific Nodes/Tasks: Look for nodes consistently showing high wait and user simultaneously—isolated instances may be less concerning than cluster-wide patterns.
Cross-check With Other Metrics: Review disk, network throughput, GC pauses, and task-level latencies to better diagnose performance bottlenecks.

In summary, your data is plausible given Databricks’ architecture and Spark’s workload profile. Focus on continuous profiling and drill down to root causes if these patterns align with degradations in job performance or SLA breaches.