Databricks Community

pathakrutuja · ‎03-24-2026

Overview

Why Platform Administration & Observability Matter

As data platforms scale, cost and complexity also scale along with them.
Platform teams today are expected to:

Control cloud spend
Enable teams to move fast
Prove value to leadership

Without visibility, cost becomes a black box.
Over-control slows innovation.
Too much freedom leads to runaway spend.
The answer isn’t tighter restrictions — it’s observability-driven guardrails.
“Monitoring keeps the lights on. Observability explains why the lights behave the way they do.”

Monitoring vs Observability

Monitoring	Observability
Detects issues	Explains behaviour
Reactive	Proactive
Metric-focused	Context-rich
“What broke?”	“Why did it happen — and what should we do?”

Cost control at scale requires observability, not just alerts.

Understanding the Databricks Cost Model

Databricks uses usage-based pricing:

Cost is driven by DBUs
Pricing varies by workload type
Compute scales independently from storage

Inefficient workloads can silently multiply costs.
“Observability is essential to link DBUs to real usage patterns and teams."

Why System Tables Are the Foundation

System tables in Databricks serve as a centralized source of truth, offering unified visibility across key operational and governance areas.

They provide essential insights into:

Usage & billing
Compute behavior
Jobs & queries
Security & access

Because they expose fine-grained signals, they power:

Cost attribution
Performance analysis
Governance
Long-term trend analysis

This dashboard is built entirely on top of system tables ensuring accuracy, scale, and explainability.

AI/BI Dashboards for Platform Administration and Observability

What This Dashboard Solves

As we already know, Databricks already provides product-level cost observability dashboard which is largely optimized for account and metastore administrators.
But there is an opportunity to extend these capabilities with team and workload-level granular insights.

Our objective is to solve this by providing granular cost attribution with drill-down capabilities, highlighting optimization opportunities, delivering executive-level summaries, and offering actionable insights for platform teams.

Available Dashboard Pages

This dashboard is structured as a multi-page, SKU-based view to enable intuitive navigation and faster insights. Instead of relying heavily on multiple filters for slicing and dicing, users can directly access dedicated pages for each SKU category, allowing focused analysis, improved clarity, and quicker decision-making.

Page	Page Name	Description
1	Executive Summary	High-level overview of total Databricks cost, savings opportunities, usage trends, and workspace-level distribution across various SKUs.
2	All-Purpose Cluster Cost Analysis	High level analysis of interactive cluster spending, usage patterns, and migration optimization opportunities to job clusters.
3	Job Cluster Cost Analysis	High level analysis of job cluster costs, run behavior, resource utilization, and operational efficiency.
4	Serverless Cost Analysis	High level analysis of serverless workloads with breakdown of spending across serverless products, usage patterns, and user-level attribution.
5	SQL Warehouse Analysis	High level analysis of DBSQL utilization, cost and insights for SQL Warehouses across types, configurations, and uptime behavior.
6	Unfollowed Best Practices	Identification of configuration and governance gaps impacting cost, performance, and compliance.
7	Executive Summary Details	Comprehensive analysis of granular cost distribution across regions, workspaces, SKUs, cluster types, and time-based trends.
8	All-Purpose Cost Analysis Details	Comprehensive analysis of interactive cluster-level configuration, cost attribution, and optimization candidate deep dive.
9	Job Cluster Cost Analysis Details	Comprehensive analysis of job clusters with run-level cost, performance metrics, and job efficiency evaluation.
10	Serverless Analysis Details	Comprehensive analysis of serverless resources, jobs, notebooks, and user-level usage and cost breakdown.
11	SQL Warehouse Details	Comprehensive analysis of SQL warehouse configuration, uptime tracking, SKU attribution, and governance compliance.
12	Unfollowed Best Practices Details	Comprehensive analysis with listing of non-compliant clusters, warehouses, and configuration inefficiencies.

Getting Started

Pre-Requisites

Active Databricks workspace
Below System schemas should be enabled:

system.access
system.billing
system.compute
system.lakeflow

The user / service principal used to publish the dashboard should have the required access to the system tables.
The user / service principal used to publish the dashboard should have CAN USE permission on the underlying SQL Warehouse.

Step 1: Download Dashboard Assets

Download all required assets from here and run/schedule the script to further materialize tables and deploy the pre-built AI/BI dashboard.

You’ll get:

Notebook with Materialized table queries: materialize_dashboard_queries_run_parallely.py
This notebook is designed to materialize all the key dashboard tables - covering cost, usage, reliability, and platform hygiene.

Customized discounts & currency conversion(if applicable)
Acts as a single source of truth

AI/BI Dashboard (.lvdash): Databricks Cost Tracking.lvdash.json

Pre-built datasets and visualizations

Full documentation is available in the README.md file included in the repository.

Step 2: Import Assets into Your Workspace

You can place both assets that best fit within your workspace structure.

Step 3: Run the Queries Notebook and materialize the tables.

Run this Materialization notebook and schedule it (recommended for daily) as a lakeflow job.
Configure the below parameters as per your requirement:

destination_catalog:
The catalog where Materialized tables will be stored.
Currently it is defaulted to main
Ensure the catalog exists

destination_schema:
The schema within the catalog to store the Materialized tables.
Currently it is defaulted to default.
Ensure the schema exists

currency_conversion:
Conversion rate to apply if reporting in a currency other than USD
Defaulted to USD, no conversion

discount:
Percentage discount to apply to usage costs, if applicable
Defaulted to 0%

Step 4: Point the Dashboard to Your custom catalog and schema.

Once the Materialize Table Queries notebook completes, update the dashboard datasets using either method below:

Update the Datasets in the dashboard through UI

Open the AI/BI dashboard
Navigate to the Data tab
Update all queries to point to the configured destination catalog and schema(from the Materialized Table Queries notebook)

OR

Update via .ivdash JSON:

Replace main.default with the configured destination catalog and schema(from the Queries notebook)

By default, the destination_catalog and destination_schema are set to “main” and “default”, unless modified.

Step 5: Refresh the Dashboard

Refresh the dashboard and confirm visuals are populated.
You can also schedule automatic refreshes(daily recommended).

Now you’re live !!

Diving deeper into Dashboard Pages

Executive Summary

This Page provides an executive-level overview of platform cost, utilization, efficiency, and potential savings across environments and workspaces. Designed for Metastore and Workspace Admins, it enables quick assessment of spend, performance, and optimization opportunities through the below key KPIs.

Summary Metrics

Key Metrics:

Fig 1: Executive summary: Key Metrics

Total Cost & Usage (All Workloads):
- Consolidated view of cost and DBU usage across all workload types, aggregated by time, workspace, and region, with pricing, discounts, and currency conversion applied.
Potential Savings Opportunity:
- Estimates actionable cost savings for All-Purpose and Job clusters based on recent usage patterns, compute utilization, and job execution efficiency.
Cost Breakdown by Compute & Execution Outcome:
- Breaks down spend by compute type and job outcome, highlighting the cost impact of failed runs and reliability gaps.

Fig 2: Executive summary: Cost Breakdown

Cost & Potential Savings by Compute Type:
- Combines current spend with optimization opportunities, including workload migration, right-sizing, and efficiency improvements.

Screenshot 2026-03-24 at 7.42.04 PM.png

Fig 3: Executive summary: Cost & Potential Savings

Daily/Monthly Trends:
- Tracks how total cost and potential savings evolve over time to identify patterns and optimization impact.

Fig 4: Executive summary: Daily/Monthly Trends

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "Executive summary details" page as shown below.

Screenshot 2026-03-24 at 7.44.07 PM.png

Fig 5: Executive summary details

Cost Summary by Workspace and details:

Cost Summary by Workspace shows total DBU usage and cost by workspace over time, enabling quick comparison of spend across environments.

Perfect for:

Metastore admins
Platform leads
FinOps stakeholders

All-Purpose Cluster

This section provides a comprehensive view of All-Purpose cluster usage and costs, combining high-level spend and usage metrics with detailed cluster and job-level insights. It helps identify cost drivers, track usage trends, and uncover savings opportunities by highlighting workloads that are better suited for job clusters.

Summary Metrics

Key Metrics:

Fig 6: All Purpose Cluster summary: Key Metrics

Total Cost ($):

Exact spend incurred by All-Purpose clusters.

Total Usage (DBUs):

Overall compute consumption on All-Purpose clusters.

Total Job Runs:

Total volume of job executions, showing workload intensity.

Potential Savings Opportunity:

Highlights the estimated savings that could be realized by shifting eligible workloads from All-Purpose to Job clusters.

Fig 7: All Purpose Cluster summary: Other Metrics

Total Cost by All-Purpose Cluster Types and SKUs:

Shows how costs are distributed across different cluster configurations and SKUs, helping identify major contributors to spend.

Daily/Monthly Trends:

Illustrates how All-Purpose cluster costs and associated savings opportunities change over time, enabling pattern recognition and optimization planning.

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "All Purpose Cost Analysis Details" page as shown below.

Screenshot 2026-03-24 at 7.46.39 PM.png

Fig 8: All Purpose Cluster details

All-Purpose Cluster Costs Insights

Shows the total cost and DBU usage per All-Purpose cluster over the selected period, giving a high-level view of resource consumption and spending by cluster, workspace, and SKU. It also shows the compute configurations to the end user to reflect and take further actions.

Jobs Run On All-Purpose Cluster With Savings Opportunity

Provides job-run level usage, costs, and potential savings for the all-purpose compute used, thereby highlighting the need to switch to job compute and reduce costs.

Why it matters

Highlights interactive workloads quietly driving cost
Identifies jobs better suited for job clusters

Job Cluster

This section delivers end-to-end visibility into job compute usage and costs, from overall spend and trends to detailed job- and run-level analysis. It helps identify inefficient, failure-prone, or high-cost jobs, quantify savings opportunities through better configuration and tuning, and guide right-sizing decisions to optimize job performance and cost efficiency.

Summary Metrics:

Fig 9: Job Cluster summary: Key Metrics

Key Metrics:

Total Cost ($):

Total spend for job runs using job clusters

Total Usage (DBUs):

Overall job compute consumption

Total Job Runs:

Volume of job executions

Potential Savings Opportunity:

Estimated savings by tuning the job compute configurations

Fig 10: Job Cluster summary: Other metrics

Cost Breakdown:
- By SKU specific to job compute
- By Job run Types (Databricks workflows or externally triggered using ADF etc.)
- By Job Run Result (Succeeded or Failed)

Daily/Monthly Trends:

Changes in cost and savings opportunities over time for optimization planning
Across job run status like succeeded, failed, and cancelled runs, helping pinpoint inefficiencies and optimize workload performance.

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "Job Cluster Cost Analysis Details" page.

Screenshot 2026-03-24 at 7.50.08 PM.png

Fig 11: Job Cluster details

Total cost and DBU usage per cluster, workspace, SKU
Job wise costing:

Provides a detailed view of each job, the SKUs used, and associated costs, helping identify high-cost jobs that may need attention.

Run-wise Costing, Usage, and Potential Savings:

Calculates success/failure counts and rates per job, estimating potential savings based on CPU utilization to highlight inefficient runs and optimization opportunities.

Job Execution Cost Analysis:

Combines success ratios, total job runs, and potential savings to identify high-cost or failure-prone jobs and guide resource optimization.

Why it matters

Pinpoints inefficient or failure-prone jobs
Guides right-sizing and compute tuning

Note:This dashboard is designed specifically for Azure Databricks and includes insights such as Cost of Job Cluster by Run Type (Workflow run vs. ADF run) and Job ID or Pipeline Name (ADF pipeline), which are tailored to Azure. Minor tweaks may be required to adapt it for other cloud platforms.

Serverless

This section provides clear visibility into serverless usage and costs by combining high-level spend trends with detailed workload, user, and asset-level insights. It helps uncover hidden cost drivers behind serverless convenience, highlights who and what is consuming the most serverless DBUs, and enables accountability and informed optimization decisions.

Summary Metrics

Fig 12: Serverless summary: Key Metrics

Key Metrics:

Total Cost ($):

Total spend for workloads using serverless

Total Usage (DBUs):

Overall serverless consumption

Fig 13: Serverless summary: Other metrics

Cost Breakdown:

By SKU:
- Cost distribution for different serverless specific SKUs
By Workload Types:
- Costs attributing to various workloads using Serverless (Interactive notebooks, Jobs, Apps, etc.)

Daily/Monthly Trends:

Tracks changes in cost across various serverless workload types over time for optimization planning

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "Serverless Analysis Details" page.

Screenshot 2026-03-24 at 7.52.24 PM.png

Fig 14: Serverless details

Top Overall Serverless Spend:
- Highlights the workloads with highest cost using serverless providing a clear view of where the majority of the spend is concentrated.

Top Users with maximum serverless usage:
- Highlights the users driving the most serverless compute and spend, helping identify accountability and opportunities for optimization.

Top Jobs with maximum serverless usage:
- Provides a detailed view of the most resource-intensive serverless jobs, including the user owning the job, helping pinpoint high-cost or inefficient jobs.

Top Notebooks with maximum serverless usage:
- Provides a list of interactive notebooks using serverless along with the user details, to guide resource management and optimization.

Why it matters
- Serverless is convenient but can hide heavy usage
- This brings accountability and transparency

SQL Warehouse

This section provides a unified view of SQL Warehouse cost, usage, and uptime across workspaces and warehouse types. By combining spend trends, configuration details, and uptime insights, it helps uncover always-on or inefficient warehouses, understand key cost drivers, and identify optimization opportunities through better sizing, scheduling, and usage patterns.

Summary Metrics

Fig 15: SQL warehouse summary: Key metrics

Key Metrics:

Total SQL Warehouse Cost ($): Overall spend across all SQL Warehouses (Classic, Pro, Serverless).
Total SQL Usage: Aggregate DBU consumption by SQL Warehouses, reflecting overall query activity.

Screenshot 2026-03-24 at 7.53.53 PM.png

Fig 16: SQL warehouse summary: other metric

Cost Breakdown:
- By SQL Warehouse Type: Distribution of cost across Classic, Pro, and Serverless SQL Warehouses.
- By SKU: Cost split by SQL Warehouse SKUs, highlighting pricing drivers across warehouse configurations.
- By Workspace: Contribution of each workspace to the total SQL Warehouse spend.

Uptime & Utilization Insights:
- Top 10 Workspaces by Warehouse Uptime: Identifies workspaces where SQL Warehouses remain active the longest, signaling sustained usage or potential inefficiencies.

Daily/Monthly Trends:

SQL Warehouse Cost Trend: Tracks daily and monthly cost movement across SQL Warehouses to identify growth patterns, seasonality, and optimization windows.

Detailed Metrics: You can go through the Detailed metrics either by "Drill to" or by Navigating to the "SQL Warehouse Details" page.

Screenshot 2026-03-24 at 7.56.39 PM.png

Fig 17: SQL warehouse details

SQL Warehouse Cost Insights:

Breaks down SQL Warehouse costs along with key configuration details to understand cost drivers.

Top Warehouses by Uptime:

Shows warehouses with the highest uptime hours to identify always-on cost hotspots.

Why it matters
- Uptime and configuration visibility reveal always-on and inefficient warehouses
- Trend analysis highlights growth patterns and optimization windows

Unfollowed Best Practices

Surfaces silent cost risks, including:

Outdated DBR versions: Older runtimes may not benefit from the latest built-in performance enhancements and cost optimizations.
Missing or high auto-termination: Leads to clusters running idle and incurring unnecessary costs.
Fixed worker counts (no autoscaling): Prevents dynamic scaling, causing overprovisioning or underutilization.
On-demand only clusters: Misses cost savings available through spot options.
Workspaces with:
- Active SQL Warehouses with little or no query activity still generate cost.
- Warehouses keep running longer than required, thereby increasing spend.
- Prevents access to the latest features, fixes, and performance enhancements.

Idle warehouses:
Missing or high auto-stop settings for SQL Warehouses:
Missing current channel:

Screenshot 2026-03-24 at 7.59.40 PM.png

Fig 18: Unfollowed best practices summary

Screenshot 2026-03-24 at 8.01.08 PM.png

Fig 19: Unfollowed best practices details

The Summary tab helps you quickly understand what’s happening at the workspace level, while the Detailed tab lets you dive deeper into the exact workspace, computes and the respective jobs driving those insights.

These insights help teams:

Reduce technical debt
Prevent avoidable spend
Improve platform hygiene

Genie Integration

Databricks Genie is an AI-powered conversational interface that transforms how you interact with data. Instead of writing complex SQL queries or navigating through multiple dashboard filters, Genie allows you to simply ask questions in natural language and receive instant, accurate insights directly from your data lakehouse. Enabling Genie alongside your Databricks AI BI Dashboard creates a powerful dual-mode analytics experience as shown below:

Fig 9: Query insights using Genie with Natural Language.

To enable genie follow the below steps:

Open the Dashboard in Draft mode.
Navigate to the Kebab Menu(three dots).
Go to settings
Click on General
Enable genie
Link your genie room.

Built-In Operational Strength

Automated Operations: Routine maintenance tasks such as VACUUM and OPTIMIZE are automated in the Materialized Table Queries notebook to ensure performance efficiency and storage optimization for the Materialized tables.
Custom Discounts and Currency Conversion: Costs are adjusted for applicable discounts and can be converted into the required currency to provide accurate, standardized financial reporting across regions.
Scheduled Queries and Dashboards: All queries and dashboards can be scheduled to run on a scheduled frequency to ensure dashboard data is consistently refreshed and always up to date with the latest system table data.
It is recommended to run the queries notebook and dashboard using a Service Principal to ensure appropriate access to system tables.

Final Takeaway

This dashboard isn’t just about cost reporting.
It enables:

Informed decisions
Accountability without friction
Optimization without slowing teams
Confidence in platform scale

When observability connects cost to behavior, platform teams move from Reactive actions to Proactive decision making.

Appendix

soumyashree · ‎03-25-2026

Very comprehensive, thanks for putting this together!

Databricks Community

Databricks Platform Observability AI BI Dashboard

Overview

Why Platform Administration & Observability Matter

Monitoring vs Observability

Understanding the Databricks Cost Model

Why System Tables Are the Foundation

AI/BI Dashboards for Platform Administration and Observability

What This Dashboard Solves

Available Dashboard Pages

Getting Started

Pre-Requisites

Step 1: Download Dashboard Assets

Step 2: Import Assets into Your Workspace

Step 3: Run the Queries Notebook and materialize the tables.

Step 4: Point the Dashboard to Your custom catalog and schema.

Step 5: Refresh the Dashboard

Diving deeper into Dashboard Pages

Executive Summary

All-Purpose Cluster

Job Cluster

Serverless

SQL Warehouse

Unfollowed Best Practices

Genie Integration

Built-In Operational Strength

Final Takeaway

Appendix

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks