Databricks Community

Phani1 · ‎11-14-2024

How can we find out the usage details of the Databricks cluster? Specifically, we need to know how many nodes are in use, how long the cluster is idle, the time it takes to start up, and the jobs it is running along with their durations. Is there a query we can run on the system table to check if the cluster is over-provisioned or if it is using its configuration effectively? We also need similar information for serverless and SQL warehouse. Please provide any queries that can help us gather these details.

Walter_C · ‎11-15-2024

You can try the following query on your system tables:

WITH cluster_usage AS (
  SELECT 
    u.usage_metadata.cluster_id,
    u.usage_date,
    u.usage_start_time,
    u.usage_end_time,
    DATEDIFF(second, u.usage_start_time, u.usage_end_time) AS duration_seconds,
    u.usage_quantity AS dbus_used,
    CASE 
      WHEN u.usage_metadata.job_id IS NOT NULL THEN 'job'
      WHEN u.usage_metadata.notebook_id IS NOT NULL THEN 'notebook'
      ELSE 'interactive'
    END AS usage_type,
    u.usage_metadata.job_id,
    u.usage_metadata.notebook_id
  FROM system.billing.usage u
  WHERE u.usage_date >= DATEADD(day, -30, CURRENT_DATE)
    AND u.usage_metadata.cluster_id IS NOT NULL
)
SELECT 
  c.cluster_id,
  c.cluster_name,
  c.dbr_version,
  c.worker_count,
  c.auto_termination_minutes,
  cu.usage_date,
  SUM(cu.duration_seconds) AS total_duration_seconds,
  SUM(CASE WHEN cu.usage_type = 'interactive' THEN cu.duration_seconds ELSE 0 END) AS idle_seconds,
  SUM(cu.dbus_used) AS total_dbus_used,
  COUNT(DISTINCT CASE WHEN cu.usage_type = 'job' THEN cu.job_id END) AS unique_jobs_run,
  AVG(CASE WHEN cu.usage_type = 'job' THEN cu.duration_seconds END) AS avg_job_duration_seconds
FROM system.compute.clusters c
JOIN cluster_usage cu ON c.cluster_id = cu.cluster_id
GROUP BY 
  c.cluster_id, 
  c.cluster_name, 
  c.dbr_version, 
  c.worker_count, 
  c.auto_termination_minutes, 
  cu.usage_date
ORDER BY c.cluster_name, cu.usage_date

This query provides:

Cluster configuration details (cluster_id, cluster_name, Databricks Runtime version, worker count, autotermination settings)
Daily usage statistics
Idle time estimation (assuming interactive usage without a job or notebook ID is idle time)
Total DBUs used
Number of unique jobs run
Average job duration

To identify if a cluster is over-provisioned or not using its configuration effectively, look for patterns such as:

High idle time relative to total duration
Low DBU usage compared to the worker count
Few jobs running on clusters with many workers
Long average job durations on clusters with many workers

hboleto · ‎01-24-2025

Sorry for highjacking but I have a similar need but I would like to specifically know how much an interactive cluster is being used compared to it's uptime, be it in time up vs time in use or in potential dub's and how many where used to do actual work.

Is there a way to get this information ?

Thank you

Isi · ‎01-24-2025

Hey!

You can approach this by analyzing cluster usage through Spark UI, monitoring DBU consumption, and leveraging cloud provider cost tracking.

Spark UI provides insights into job execution times, resource utilization, and idle vs. active periods. Reviewing these metrics can help identify inefficiencies and periods of underutilization. Additionally, Databricks charges DBUs (Databricks Units) based on cluster uptime and resource allocation, meaning costs accrue even when the cluster is idle. Factors like auto-scaling, instance types, and idle time can influence DBU consumption.

Another useful approach is checking the monitoring tools provided by your cloud provider. AWS CloudWatch, Azure Cost Management, and GCP Billing Reports can help you track instance uptime and correlate it with Databricks usage. This allows you to better understand how much you’re paying for actual workloads versus idle time.

If you’re using SQL clusters, Databricks offers built-in monitoring where you can track cluster uptime, under the SQL > Monitoring tab, the number of clusters started, tasks executed, and queued workloads. This can provide a clearer picture of usage patterns and help with cost optimization.

To optimize costs, consider enabling auto-termination to shut down clusters when they’re idle, using System Tables to analyze job-level costs, and possibly running workloads as scheduled jobs instead of interactive sessions when appropriate.

If you find this answer helpful, feel free to mark it as resolved or give it a 👍!

🙂

hboleto · ‎01-24-2025

Thank you for your feedback, but in this case these won't help, but let me clarify the situation, since most of the time these suggestions would be absolutely spot on.

In my case at the moment I have an all purpose interactive cluster for my developers to use, which means it's usage is very ad hoc and it's used by a growing number of users (half a dozen now, but we intend to at least double that in the short term and to keep adding) .

Now with the addition of the serverless clusters that we can use to run our notebooks I wanted to be able to more accurately predict how much switching to it over the all purpose would cost, but comparing just a straight comparison of how much time I have the cluster up and how much it would cost a serverless for the same amount of time is not very accurate and gives a skewed picture, so I was trying to find a why to have a more accurate idea of how much serveless usage we would have.

Isi · ‎01-24-2025

Hey @hboleto

It’s difficult to accurately estimate the final cost of a Serverless cluster, as it is fully managed by Databricks. In contrast, Classic clusters allow for finer resource tuning since you can define spot instances and other instance types to optimize costs.

I understand that you’re considering the use of SQL Warehouse Serverless vs. Classic for running notebooks, with the assumption that it will help you spend less on your cloud provider. While this is partially true in terms of management, you should keep in mind that Serverless is typically 10-20% more expensive than Classic, based on my experience. The key advantage of Serverless is its immediacy and convenience for specific queries, but if your team is growing, the overall usage might become more intensive than expected, even with a 1-minute auto-termination policy.

If you’re using it solely for “small” and occasional queries, Serverless could be a cost-effective option in some cases. However, in a real-world development environment, it’s unlikely that a user would spin up a cluster just to run a single SELECT statement and not use it again. Additionally, for queries that take only a few seconds, Serverless’ seamless scaling can lead to unexpected costs, as you don’t directly control how the cluster operates. This could result in paying for additional instances without realizing it.

When it comes to job clusters, I wouldn’t recommend using Serverless under any circumstances. It’s much better to invest time in properly provisioning resources on a Classic cluster, as it allows for more granular cost control and predictable performance over time.

I also recall reading this post, where someone shared their experience with Serverless clusters in a very insightful way.

Pricing calculator

Best regards 🙂

Databricks Community

Cluster idle time and usage details

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!