cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
thais_henrique
Databricks Employee
Databricks Employee

Azure Databricks VM Inventory Dashboard

When we work with classic compute to create clusters on Azure Databricks, the Virtual Machines will be created on Azure for both drivers and workers. With System Tables enabled, teams can record, audit, and analyze every VM’s specs, role, and uptime for advanced cost and operational management

Managing virtual machine (VM) resources in Databricks is vital for cost optimization, performance troubleshooting, and overall platform governance. This technical blog covers how Databricks users can perform VM inventory and lifecycle analysis.

Learn how to unlock granular insights about every VM (node) participating in your clusters—whether driver or worker—using Databricks System Tables and a flexible SQL approach.

 

Prerequisites: What You Need

  • System Tables Enabled: Your workspace must have Databricks System Tables turned on.
  • Databricks SQL Warehouse: You need a SQL warehouse to execute the provided inventory query and power the dashboard

Importing the Dashboard: Fast and Easy

  • JSON Dashboard File Provided: Download the Dashboard JSON file and import it into any Databricks workspace.
  • Automatic Configuration: The dashboard is pre-configured for System Table schemas and the above SQL logic—just plug into a working SQL warehouse.
  • Customizable: Once imported, you can adjust filters, visuals, or widgets to fit your organization’s clusters and reporting requirements.

 

How to Import

  1. Go to Workspace > Dashboards in your Databricks UI.
  2. Click Import Dashboard (or the “Upload” button).
  3. Select the provided JSON file and choose your active SQL warehouse.
  4. Open the dashboard and enjoy instant analytics—no coding required

thais_henrique_0-1757083903992.png

 

What Does the Dashboard Visualize?

  • Instance counts by cluster, workspace, node type, and date.
  • Resource allocation (cores, memory) over time.
  • Active VM lifecycles for audit and optimization.

Richly-Commented SQL Query for VM Lifecycle

Below is the full SQL query with detailed, practical commentary to guide customization and usage. Just run it in Databricks SQL—every comment explains why each metric matters.

 

-- ========================================================================================
-- This SQL query provides a per-instance, per-cluster, per-node-type, per-date breakdown
-- of Databricks VM usage, including whether the instance was a driver or worker, its specs,
-- autoscale settings, and its actual active minutes and average time alive during each date
-- 
-- MAIN PURPOSE: 
-- To allow in-depth usage, inventory, and operational analysis for each node
-- (driver or worker VM) that participated in any Databricks cluster in the past year.
-- Provides the ability to see how long each instance was alive, what type it was, and
-- how the cluster was configured to autoscale.
-- 
-- Practical applications: 
-- - Auditing and troubleshooting autoscaling and cluster node allocation.
-- - VM inventory and utilization analysis for cloud cost management or capacity planning.
-- - Reporting on lifecycle/lifetime of every VM per date
-- ========================================================================================

SELECT
  nt.workspace_id,                                             -- Databricks workspace containing the cluster.
  nt.cluster_id,                                               -- Unique identifier for the cluster.
  c.cluster_name,                                              -- Human-readable name of the cluster.
  nt.node_type,                                                -- VM instance type (e.g., Standard_D16s_v3).
  nt.instance_id,                                              -- Unique ID for this specific VM (driver/worker node).

  -- Flag each instance as DRIVER or WORKER depending on its role in the cluster:
  CASE WHEN nt.driver THEN 'DRIVER' ELSE 'WORKER' END AS instance_role, 
  
  -- Format the cluster activity period
  DATE(nt.start_time) AS usage_day,  

  -- Number of unique VMs (driver or worker) matching this group (in practice, just 1 due to the grouping).
  COUNT(DISTINCT nt.instance_id) AS unique_instances,         -- 1 line per minute

  -- Total logical CPU cores = instance count × cores per instance (here, basically just per instance).
  nt_core.core_count * COUNT(DISTINCT nt.instance_id) AS total_cores,
  
  -- Total memory = instance count × memory per instance (in GB).
  (nt_core.memory_mb / 1024) * COUNT(DISTINCT nt.instance_id) AS total_memory_gb,

  -- Minimum of autoscale worker limit for the cluster in the period (can indicate config changes over time).
  MIN(c.min_autoscale_workers) AS min_autoscale_workers,
  
  -- Maximum autoscale worker limit for the cluster over the period (will show you if this node could have scaled further).
  MAX(c.max_autoscale_workers) AS max_autoscale_workers,

  -- The total number of minutes this specific VM (instance) reported running in the period (all timeline rows).
  COUNT(*) AS total_active_minutes,

  -- Average number of minutes each instance was alive (for this group, will generally be the same as total_active_minutes).
  --COUNT(*) / COUNT(DISTINCT nt.instance_id) AS avg_active_minutes_per_instance,

  -- Average instance lifetime expressed in hours (good for summary dashboards).
  (COUNT(*) / COUNT(DISTINCT nt.instance_id)) / 60 AS avg_active_hours_per_instance,
  c.cluster_source,                                            -- Source of the cluster
  CASE
    WHEN c.cluster_source IN ('UI', 'API') THEN 'All-Purpose Compute'
    WHEN c.cluster_source = 'JOB' THEN 'Job Compute'
    WHEN c.cluster_source IN ('PIPELINE', 'PIPELINE_MAINTENANCE') THEN 'Pipeline Compute'
  END AS cluster_purpose                                     -- Identifies the purpose of the cluster (all-purpose, job or pipeline)
FROM
  system.compute.node_timeline nt
  -- Each row in node_timeline represents a minute of uptime for either a driver or worker node (instance) in a cluster.

  LEFT JOIN system.compute.clusters c 
    ON nt.cluster_id = c.cluster_id
    -- Join in cluster metadata (name, autoscale settings).

  LEFT JOIN system.compute.node_types nt_core 
    ON nt.node_type = nt_core.node_type
    -- Join in VM hardware specs (cores, memory).

WHERE
  nt.start_time >= DATEADD(year, -1, CURRENT_DATE())
    -- Filter for nodes started in the last year. Adjust as required for your reporting window.

GROUP BY
  nt.workspace_id,
  nt.cluster_id,
  c.cluster_name,
  c.cluster_source,
  nt.node_type,
  nt.driver,              -- Group by DRIVER/WORKER role
  nt.instance_id,         -- Group by each unique VM (so metrics reported per node, not aggregated)
  nt_core.core_count,
  nt_core.memory_mb,
  DATE(nt.start_time)
    -- Each group is: workspace, cluster, node type, date, and per-instance (per driver/worker node).
ORDER BY
  c.cluster_name, usage_day, nt.node_type;
    -- Organize output by cluster, period, and node type for easy review.

This query can be copied and pasted into Databricks SQL for immediate use, and every section is well-documented for further customization.

In summary, Dashboard enables deep operational visibility into Databricks VM lifecycle, allocation, and utilization—empowering teams to optimize cloud spend and cluster performance.