cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
pathakrutuja
Databricks Employee
Databricks Employee

Overview

Why Platform Administration & Observability Matter

As data platforms scale, cost and complexity also scale along with them.
Platform teams today are expected to:

  • Control cloud spend
  • Enable teams to move fast
  • Prove value to leadership

Without visibility, cost becomes a black box.
Over-control slows innovation.
Too much freedom leads to runaway spend.
The answer isn’t tighter restrictions — it’s observability-driven guardrails.
“Monitoring keeps the lights on. Observability explains why the lights behave the way they do.”

Monitoring vs Observability

Monitoring Observability
Detects issues Explains behaviour
Reactive Proactive
Metric-focused Context-rich
“What broke?”
“Why did it happen — and what should we do?”

Cost control at scale requires observability, not just alerts.

Understanding the Databricks Cost Model

Databricks uses usage-based pricing:

  • Cost is driven by DBUs
  • Pricing varies by workload type
  • Compute scales independently from storage

Inefficient workloads can silently multiply costs.
“Observability is essential to link DBUs to real usage patterns and teams."

Why System Tables Are the Foundation

System tables in Databricks serve as a centralized source of truth, offering unified visibility across key operational and governance areas. 

They provide essential insights into:

  • Usage & billing
  • Compute behavior
  • Jobs & queries
  • Security & access

Because they expose fine-grained signals, they power:

  • Cost attribution
  • Performance analysis
  • Governance
  • Long-term trend analysis

This dashboard is built entirely on top of system tables ensuring accuracy, scale, and explainability.

AI/BI Dashboards for Platform Administration and Observability

What This Dashboard Solves

As we already know, Databricks already provides product-level cost observability dashboard which is largely optimized for account and metastore administrators.
But there is an opportunity to extend these capabilities with team and workload-level granular insights.

Our objective is to solve this by providing granular cost attribution with drill-down capabilities, highlighting optimization opportunities, delivering executive-level summaries, and offering actionable insights for platform teams.

Available Dashboard Pages

This dashboard is structured as a multi-page, SKU-based view to enable intuitive navigation and faster insights. Instead of relying heavily on multiple filters for slicing and dicing, users can directly access dedicated pages for each SKU category, allowing focused analysis, improved clarity, and quicker decision-making.

Page

Page Name

Description

1

Executive Summary

High-level overview of total Databricks cost, savings opportunities, usage trends, and workspace-level distribution across various SKUs.

2

All-Purpose Cluster Cost Analysis

High level analysis of interactive cluster spending, usage patterns, and migration optimization opportunities to job clusters.

3

Job Cluster Cost Analysis

High level analysis of job cluster costs, run behavior, resource utilization, and operational efficiency.

4

Serverless Cost Analysis

High level analysis of serverless workloads with breakdown of spending across serverless products, usage patterns, and user-level attribution.

5

SQL Warehouse Analysis

High level analysis of DBSQL utilization, cost and insights for SQL Warehouses across types, configurations, and uptime behavior.

6

Unfollowed Best Practices

Identification of configuration and governance gaps impacting cost, performance, and compliance.

7

Executive Summary Details

Comprehensive analysis of granular cost distribution across regions, workspaces, SKUs, cluster types, and time-based trends.

8

All-Purpose Cost Analysis Details

Comprehensive analysis of interactive cluster-level configuration, cost attribution, and optimization candidate deep dive.

9

Job Cluster Cost Analysis Details

Comprehensive analysis of job clusters with run-level cost, performance metrics, and job efficiency evaluation.

10

Serverless Analysis Details

Comprehensive analysis of serverless resources, jobs, notebooks, and user-level usage and cost breakdown.

11

SQL Warehouse Details

Comprehensive analysis of SQL warehouse configuration, uptime tracking, SKU attribution, and governance compliance.

12

Unfollowed Best Practices Details

Comprehensive analysis with listing of non-compliant clusters, warehouses, and configuration inefficiencies.

Getting Started 

Pre-Requisites

  • Active Databricks workspace
  • Below System schemas should be enabled:
    • system.access
    • system.billing
    • system.compute
    • system.lakeflow
  • The user / service principal used to publish the dashboard should have the required access to the system tables.
  • The user / service principal used to publish the dashboard should have CAN USE permission on the underlying SQL Warehouse.

Step 1: Download Dashboard Assets

Download all required assets from here and run/schedule the script to further materialize tables and deploy the pre-built AI/BI dashboard.

You’ll get:

  • Notebook with Materialized table queries: materialize_dashboard_queries_run_parallely.py
  • This notebook is designed to materialize all the key dashboard tables - covering cost, usage, reliability, and platform hygiene.
    • Customized discounts & currency conversion(if applicable)
    • Acts as a single source of truth
  • AI/BI Dashboard (.lvdash): Databricks Cost Tracking.lvdash.json
    • Pre-built datasets and visualizations
  • Full documentation is available in the README.md file included in the repository.

Step 2: Import Assets into Your Workspace

You can place both assets that best fit within your workspace structure.

Step 3: Run the Queries Notebook and materialize the tables.

Run this Materialization notebook and schedule it (recommended for daily) as a lakeflow job.
Configure the below parameters as per your requirement:

  • destination_catalog
  • The catalog where Materialized tables will be stored.
  • Currently it is defaulted to main
  • Ensure the catalog exists
  • destination_schema
  • The schema within the catalog to store the Materialized tables.
  • Currently it is defaulted to default.
  • Ensure the schema exists
  • currency_conversion
  • Conversion rate to apply if reporting in a currency other than USD
  • Defaulted to USD, no conversion
  • discount: 
  • Percentage discount to apply to usage costs, if applicable
  • Defaulted to 0%

Step 4: Point the Dashboard to Your custom catalog and schema.

Once the Materialize Table Queries notebook completes, update the dashboard datasets using either method below:

  • Update the Datasets in the dashboard through UI
    • Open the AI/BI dashboard
    • Navigate to the Data tab
    • Update all queries to point to the configured destination catalog and schema(from the Materialized Table Queries notebook)

OR

  • Update via .ivdash JSON:
    • Replace main.default with the configured destination catalog and schema(from the Queries notebook)

By default, the destination_catalog and destination_schema are set to “main” and “default”, unless modified.

Step 5: Refresh the Dashboard

Refresh the dashboard and confirm visuals are populated.
You can also schedule automatic refreshes(daily recommended).

Now you’re live !!

Diving deeper into Dashboard Pages

Executive Summary

This Page provides an executive-level overview of platform cost, utilization, efficiency, and potential savings across environments and workspaces. Designed for Metastore and Workspace Admins, it enables quick assessment of spend, performance, and optimization opportunities through the below key KPIs.

Summary Metrics

Key Metrics:

1.png

Fig 1: Executive summary: Key Metrics

  • Total Cost & Usage (All Workloads)
    • Consolidated view of cost and DBU usage across all workload types, aggregated by time, workspace, and region, with pricing, discounts, and currency conversion applied.
  • Potential Savings Opportunity
    • Estimates actionable cost savings for All-Purpose and Job clusters based on recent usage patterns, compute utilization, and job execution efficiency.
  • Cost Breakdown by Compute & Execution Outcome: 
    • Breaks down spend by compute type and job outcome, highlighting the cost impact of failed runs and reliability gaps.

2.png

Fig 2: Executive summary: Cost Breakdown

  • Cost & Potential Savings by Compute Type:
    • Combines current spend with optimization opportunities, including workload migration, right-sizing, and efficiency improvements.

Screenshot 2026-03-24 at 7.42.04 PM.png

Fig 3: Executive summary: Cost & Potential Savings

  • Daily/Monthly Trends
    • Tracks how total cost and potential savings evolve over time to identify patterns and optimization impact.

4.png

Fig 4: Executive summary: Daily/Monthly Trends

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "Executive summary details" page as shown below.

Screenshot 2026-03-24 at 7.44.07 PM.png

Fig 5: Executive summary details

  • Cost Summary by Workspace and details: 

    • Cost Summary by Workspace shows total DBU usage and cost by workspace over time, enabling quick comparison of spend across environments.
  • Perfect for:

    • Metastore admins
    • Platform leads
    • FinOps stakeholders

All-Purpose Cluster 

This section provides a comprehensive view of All-Purpose cluster usage and costs, combining high-level spend and usage metrics with detailed cluster and job-level insights. It helps identify cost drivers, track usage trends, and uncover savings opportunities by highlighting workloads that are better suited for job clusters.

Summary Metrics

Key Metrics:

6.png

Fig 6: All Purpose Cluster summary: Key Metrics

  • Total Cost ($): 
    • Exact spend incurred by All-Purpose clusters.
  • Total Usage (DBUs): 
    • Overall compute consumption on All-Purpose clusters.
  • Total Job Runs: 
    • Total volume of job executions, showing workload intensity.
  • Potential Savings Opportunity: 
    • Highlights the estimated savings that could be realized by shifting eligible workloads from All-Purpose to Job clusters.

7.png

Fig 7: All Purpose Cluster summary: Other Metrics

  • Total Cost by All-Purpose Cluster Types and SKUs: 
    • Shows how costs are distributed across different cluster configurations and SKUs, helping identify major contributors to spend.
  • Daily/Monthly Trends
    • Illustrates how All-Purpose cluster costs and associated savings opportunities change over time, enabling pattern recognition and optimization planning.

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "All Purpose Cost Analysis Details" page as shown below.

Screenshot 2026-03-24 at 7.46.39 PM.png

Fig 8: All Purpose Cluster details

  • All-Purpose Cluster Costs Insights
    • Shows the total cost and DBU usage per All-Purpose cluster over the selected period, giving a high-level view of resource consumption and spending by cluster, workspace, and SKU. It also shows the compute configurations to the end user to reflect and take further actions.
  • Jobs Run On All-Purpose Cluster With Savings Opportunity
    • Provides job-run level usage, costs, and potential savings for the all-purpose compute used, thereby highlighting the need to switch to job compute and reduce costs. 
  • Why it matters

    • Highlights interactive workloads quietly driving cost
    • Identifies jobs better suited for job clusters

Job Cluster 

This section delivers end-to-end visibility into job compute usage and costs, from overall spend and trends to detailed job- and run-level analysis. It helps identify inefficient, failure-prone, or high-cost jobs, quantify savings opportunities through better configuration and tuning, and guide right-sizing decisions to optimize job performance and cost efficiency.

Summary Metrics:

9.png

Fig 9: Job Cluster summary: Key Metrics

Key Metrics:

    • Total Cost ($): 
      • Total spend for job runs using job clusters
    • Total Usage (DBUs): 
      • Overall job compute consumption
    • Total Job Runs: 
      • Volume of job executions
    • Potential Savings Opportunity: 
      • Estimated savings by tuning the job compute configurations

10.png

Fig 10: Job Cluster summary: Other metrics

  • Cost Breakdown: 
    • By SKU specific to job compute
    • By Job run Types (Databricks workflows or externally triggered using ADF etc.)
    • By Job Run Result (Succeeded or Failed)
  • Daily/Monthly Trends

    • Changes in cost and savings opportunities over time for optimization planning
    • Across job run status like succeeded, failed, and cancelled runs, helping pinpoint inefficiencies and optimize workload performance.

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "Job Cluster Cost Analysis Details" page.

Screenshot 2026-03-24 at 7.50.08 PM.png

Fig 11: Job Cluster details

  • Total cost and DBU usage per cluster, workspace, SKU
  • Job wise costing: 
    • Provides a detailed view of each job, the SKUs used, and associated costs, helping identify high-cost jobs that may need attention.
  • Run-wise Costing, Usage, and Potential Savings: 
    • Calculates success/failure counts and rates per job, estimating potential savings based on CPU utilization to highlight inefficient runs and optimization opportunities.
  • Job Execution Cost Analysis: 
    • Combines success ratios, total job runs, and potential savings to identify high-cost or failure-prone jobs and guide resource optimization.
  • Why it matters

    • Pinpoints inefficient or failure-prone jobs
    • Guides right-sizing and compute tuning

Note:This dashboard is designed specifically for Azure Databricks and includes insights such as Cost of Job Cluster by Run Type (Workflow run vs. ADF run) and Job ID or Pipeline Name (ADF pipeline), which are tailored to Azure. Minor tweaks may be required to adapt it for other cloud platforms.

Serverless 

This section provides clear visibility into serverless usage and costs by combining high-level spend trends with detailed workload, user, and asset-level insights. It helps uncover hidden cost drivers behind serverless convenience, highlights who and what is consuming the most serverless DBUs, and enables accountability and informed optimization decisions.

Summary Metrics 

13.png

Fig 12: Serverless summary: Key Metrics

Key Metrics:

    • Total Cost ($): 
      • Total spend for workloads using serverless
    • Total Usage (DBUs): 
      • Overall serverless consumption

14.png

Fig 13: Serverless summary: Other metrics

  • Cost Breakdown:

    • By SKU: 
      • Cost distribution for different serverless specific SKUs
    • By Workload Types: 
      • Costs attributing to various workloads using Serverless (Interactive notebooks, Jobs, Apps, etc.)
  • Daily/Monthly Trends

    • Tracks changes in cost across various serverless workload types over time for optimization planning

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "Serverless Analysis Details" page.

 Screenshot 2026-03-24 at 7.52.24 PM.png

Fig 14: Serverless details

  • Top Overall Serverless Spend: 
    • Highlights the workloads with highest cost using serverless providing a clear view of where the majority of the spend is concentrated.
  • Top Users with maximum serverless usage: 
    • Highlights the users driving the most serverless compute and spend, helping identify accountability and opportunities for optimization.
  • Top Jobs with maximum serverless usage: 
    • Provides a detailed view of the most resource-intensive serverless jobs, including the user owning the job, helping pinpoint high-cost or inefficient jobs.
  • Top Notebooks with maximum serverless usage: 
    • Provides a list of interactive notebooks using serverless along with the user details, to guide resource management and optimization.
  • Why it matters
    • Serverless is convenient but can hide heavy usage
    • This brings accountability and transparency

SQL Warehouse

This section provides a unified view of SQL Warehouse cost, usage, and uptime across workspaces and warehouse types. By combining spend trends, configuration details, and uptime insights, it helps uncover always-on or inefficient warehouses, understand key cost drivers, and identify optimization opportunities through better sizing, scheduling, and usage patterns.

Summary Metrics

16.png

Fig 15: SQL warehouse summary: Key metrics

Key Metrics:

  • Total SQL Warehouse Cost ($): Overall spend across all SQL Warehouses (Classic, Pro, Serverless).
  • Total SQL Usage: Aggregate DBU consumption by SQL Warehouses, reflecting overall query activity.

Screenshot 2026-03-24 at 7.53.53 PM.png

Fig 16: SQL warehouse summary: other metric

  • Cost Breakdown:
    • By SQL Warehouse Type: Distribution of cost across Classic, Pro, and Serverless SQL Warehouses.
    • By SKU: Cost split by SQL Warehouse SKUs, highlighting pricing drivers across warehouse configurations.
    • By Workspace: Contribution of each workspace to the total SQL Warehouse spend.
  • Uptime & Utilization Insights:
    • Top 10 Workspaces by Warehouse Uptime: Identifies workspaces where SQL Warehouses remain active the longest, signaling sustained usage or potential inefficiencies.
  • Daily/Monthly Trends

    • SQL Warehouse Cost Trend: Tracks daily and monthly cost movement across SQL Warehouses to identify growth patterns, seasonality, and optimization windows.

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "SQL Warehouse Details" page.

Screenshot 2026-03-24 at 7.56.39 PM.png

Fig 17: SQL warehouse details

  • SQL Warehouse Cost Insights: 
    • Breaks down SQL Warehouse costs along with key configuration details to understand cost drivers.
  • Top Warehouses by Uptime: 
    • Shows warehouses with the highest uptime hours to identify always-on cost hotspots.
  • Why it matters
    • Uptime and configuration visibility reveal always-on and inefficient warehouses
    • Trend analysis highlights growth patterns and optimization windows

Unfollowed Best Practices

Surfaces silent cost risks, including:

  • Outdated DBR versions: Older runtimes may not benefit from the latest built-in performance enhancements and cost optimizations.
  • Missing or high auto-termination: Leads to clusters running idle and incurring unnecessary costs.
  • Fixed worker counts (no autoscaling): Prevents dynamic scaling, causing overprovisioning or underutilization.
  • On-demand only clusters: Misses cost savings available through spot options.
  • Workspaces with:
    • Active SQL Warehouses with little or no query activity still generate cost.
    • Warehouses keep running longer than required, thereby increasing spend.
    • Prevents access to the latest features, fixes, and performance enhancements.
    • Idle warehouses: 
    • Missing or high auto-stop settings for SQL Warehouses: 
    • Missing current channel: 

Screenshot 2026-03-24 at 7.59.40 PM.png

Fig 18: Unfollowed best practices summary

Screenshot 2026-03-24 at 8.01.08 PM.png

Fig 19: Unfollowed best practices details

The Summary tab helps you quickly understand what’s happening at the workspace level, while the Detailed tab lets you dive deeper into the exact workspace, computes and the respective jobs driving those insights.

These insights help teams:

  • Reduce technical debt
  • Prevent avoidable spend
  • Improve platform hygiene

Genie Integration

Databricks Genie is an AI-powered conversational interface that transforms how you interact with data. Instead of writing complex SQL queries or navigating through multiple dashboard filters, Genie allows you to simply ask questions in natural language and receive instant, accurate insights directly from your data lakehouse. Enabling Genie alongside your Databricks AI BI Dashboard creates a powerful dual-mode analytics experience as shown below:

 

genie_integration.jpg

Fig 9: Query insights using Genie with Natural Language.

To enable genie follow the below steps:

  1. Open the Dashboard in Draft mode.
  2. Navigate to the Kebab Menu(three dots).
  3. Go to settings
  4. Click on General
  5. Enable genie
  6. Link your genie room.

Built-In Operational Strength

  • Automated Operations: Routine maintenance tasks such as VACUUM and OPTIMIZE are automated in the Materialized Table Queries notebook to ensure performance efficiency and storage optimization for the Materialized tables.
  • Custom Discounts and Currency Conversion: Costs are adjusted for applicable discounts and can be converted into the required currency to provide accurate, standardized financial reporting across regions.
  • Scheduled Queries and Dashboards: All queries and dashboards can be scheduled to run on a scheduled frequency to ensure dashboard data is consistently refreshed and always up to date with the latest system table data.
  • It is recommended to run the queries notebook and dashboard using a Service Principal to ensure appropriate access to system tables.

Final Takeaway

This dashboard isn’t just about cost reporting.
It enables:

  • Informed decisions
  • Accountability without friction
  • Optimization without slowing teams
  • Confidence in platform scale

When observability connects cost to behavior, platform teams move from Reactive actions to Proactive decision making.

Appendix

  1. System Tables
  2. AI/ BI Dashboard
  3. AI/ BI Genie Integration
1 Comment
soumyashree
New Contributor II

Very comprehensive, thanks for putting this together!