When we work with classic compute to create clusters on Azure Databricks, the Virtual Machines will be created on Azure for both drivers and workers. With System Tables enabled, teams can record, audit, and analyze every VM’s specs, role, and uptime for advanced cost and operational management
Managing virtual machine (VM) resources in Databricks is vital for cost optimization, performance troubleshooting, and overall platform governance. This technical blog covers how Databricks users can perform VM inventory and lifecycle analysis.
Learn how to unlock granular insights about every VM (node) participating in your clusters—whether driver or worker—using Databricks System Tables and a flexible SQL approach.
Below is the full SQL query with detailed, practical commentary to guide customization and usage. Just run it in Databricks SQL—every comment explains why each metric matters.
-- ========================================================================================
-- This SQL query provides a per-instance, per-cluster, per-node-type, per-date breakdown
-- of Databricks VM usage, including whether the instance was a driver or worker, its specs,
-- autoscale settings, and its actual active minutes and average time alive during each date
--
-- MAIN PURPOSE:
-- To allow in-depth usage, inventory, and operational analysis for each node
-- (driver or worker VM) that participated in any Databricks cluster in the past year.
-- Provides the ability to see how long each instance was alive, what type it was, and
-- how the cluster was configured to autoscale.
--
-- Practical applications:
-- - Auditing and troubleshooting autoscaling and cluster node allocation.
-- - VM inventory and utilization analysis for cloud cost management or capacity planning.
-- - Reporting on lifecycle/lifetime of every VM per date
-- ========================================================================================
SELECT
nt.workspace_id, -- Databricks workspace containing the cluster.
nt.cluster_id, -- Unique identifier for the cluster.
c.cluster_name, -- Human-readable name of the cluster.
nt.node_type, -- VM instance type (e.g., Standard_D16s_v3).
nt.instance_id, -- Unique ID for this specific VM (driver/worker node).
-- Flag each instance as DRIVER or WORKER depending on its role in the cluster:
CASE WHEN nt.driver THEN 'DRIVER' ELSE 'WORKER' END AS instance_role,
-- Format the cluster activity period
DATE(nt.start_time) AS usage_day,
-- Number of unique VMs (driver or worker) matching this group (in practice, just 1 due to the grouping).
COUNT(DISTINCT nt.instance_id) AS unique_instances, -- 1 line per minute
-- Total logical CPU cores = instance count × cores per instance (here, basically just per instance).
nt_core.core_count * COUNT(DISTINCT nt.instance_id) AS total_cores,
-- Total memory = instance count × memory per instance (in GB).
(nt_core.memory_mb / 1024) * COUNT(DISTINCT nt.instance_id) AS total_memory_gb,
-- Minimum of autoscale worker limit for the cluster in the period (can indicate config changes over time).
MIN(c.min_autoscale_workers) AS min_autoscale_workers,
-- Maximum autoscale worker limit for the cluster over the period (will show you if this node could have scaled further).
MAX(c.max_autoscale_workers) AS max_autoscale_workers,
-- The total number of minutes this specific VM (instance) reported running in the period (all timeline rows).
COUNT(*) AS total_active_minutes,
-- Average number of minutes each instance was alive (for this group, will generally be the same as total_active_minutes).
--COUNT(*) / COUNT(DISTINCT nt.instance_id) AS avg_active_minutes_per_instance,
-- Average instance lifetime expressed in hours (good for summary dashboards).
(COUNT(*) / COUNT(DISTINCT nt.instance_id)) / 60 AS avg_active_hours_per_instance,
c.cluster_source, -- Source of the cluster
CASE
WHEN c.cluster_source IN ('UI', 'API') THEN 'All-Purpose Compute'
WHEN c.cluster_source = 'JOB' THEN 'Job Compute'
WHEN c.cluster_source IN ('PIPELINE', 'PIPELINE_MAINTENANCE') THEN 'Pipeline Compute'
END AS cluster_purpose -- Identifies the purpose of the cluster (all-purpose, job or pipeline)
FROM
system.compute.node_timeline nt
-- Each row in node_timeline represents a minute of uptime for either a driver or worker node (instance) in a cluster.
LEFT JOIN system.compute.clusters c
ON nt.cluster_id = c.cluster_id
-- Join in cluster metadata (name, autoscale settings).
LEFT JOIN system.compute.node_types nt_core
ON nt.node_type = nt_core.node_type
-- Join in VM hardware specs (cores, memory).
WHERE
nt.start_time >= DATEADD(year, -1, CURRENT_DATE())
-- Filter for nodes started in the last year. Adjust as required for your reporting window.
GROUP BY
nt.workspace_id,
nt.cluster_id,
c.cluster_name,
c.cluster_source,
nt.node_type,
nt.driver, -- Group by DRIVER/WORKER role
nt.instance_id, -- Group by each unique VM (so metrics reported per node, not aggregated)
nt_core.core_count,
nt_core.memory_mb,
DATE(nt.start_time)
-- Each group is: workspace, cluster, node type, date, and per-instance (per driver/worker node).
ORDER BY
c.cluster_name, usage_day, nt.node_type;
-- Organize output by cluster, period, and node type for easy review.
This query can be copied and pasted into Databricks SQL for immediate use, and every section is well-documented for further customization.
In summary, Dashboard enables deep operational visibility into Databricks VM lifecycle, allocation, and utilization—empowering teams to optimize cloud spend and cluster performance.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.