Databricks Compute Metrics Alerts

pranaav93 — Tue, 14 Oct 2025 07:08:37 GMT

Hi All,

Im looking for some implementation ideas where i can use information from the system.compute.node_timeline table to catch memory spikes and if above a given threshold restart the cluster through an API call.

Have any of you implemented a similar solution? Any references would help.

Re: Databricks Compute Metrics Alerts

NandiniN — Wed, 15 Oct 2025 14:06:39 GMT

Hey @pranaav93

A very common use case for using system table system.compute.node_timeline to build alerting and remediation.

Check this KB https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline here use case is to programmatically get a cluster’s memory usage.

To check memory usage (in bytes), join the table node_timeline with the table node_types. Run the following code in a notebook, through a job, or with Databricks SQL.

select cluster_id, instance_id, start_time, end_time, round(mem_used_percent / 100 * node_types.memory_mb, 0) as mem_used_mb
from system.compute.node_timeline
join system.compute.node_types using(node_type)
order by start_time desc;

Next you can add a threshold (e.g., mem_used_percent > 90%) and a sustained duration (e.g., COUNT(*) >= 3 consecutive minutes) to filter out transient spikes. Make sure this is stored in a table, can be a delta table monitoring.alert_cluster_restart.

Then schedule the SQL query or notebook to run as a Databricks Job every 5-10 minutes. When the above query has results or the table has entries - which are not addressed, using the rest API you can restart the cluster https://docs.databricks.com/api/workspace/clusters/restart

Disclaimer - not tested, but it sounds this should help, let me know if you face issues while implementing it and I can help.

Thanks!

topic Re: Databricks Compute Metrics Alerts in Data Engineering

Databricks Compute Metrics Alerts

Re: Databricks Compute Metrics Alerts