Hi everyone,
I'm working on building a hardware metrics dashboard using the system.compute schema in Databricks, specifically leveraging the cluster, node_type, and node_timeline tables.
While analyzing the data, I came across something that seems contradictory to common industry guidance:
It's generally accepted that if I/O wait exceeds 10%, it indicates CPU performance degradation due to the processor waiting on disk or network I/O.
However, in several cases from my data, I noticed that even when cpu_wait_percent is greater than 10%, the cpu_user_percent is still above 90% — which suggests the CPU is actively doing useful work.
This seems counterintuitive. Shouldn't high I/O wait reduce the CPU's ability to perform user-level tasks?
Has anyone else observed this behavior in Databricks or Spark environments? Could this be due to how metrics are sampled or reported in the system tables? Or is it possible that multiple cores are being utilized in parallel, masking the impact of I/O wait?
Any insights or explanations would be greatly appreciated!
Thanks in advance!