cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Compute Metrics Alerts

pranaav93
New Contributor III

Hi All,

Im looking for some implementation ideas where i can use information from the system.compute.node_timeline table to catch memory spikes and if above a given threshold restart the cluster through an API call. 

Have any of you implemented a similar solution? Any references would help.

laughingbroccoli93
1 ACCEPTED SOLUTION

Accepted Solutions

NandiniN
Databricks Employee
Databricks Employee

Hey @pranaav93 

A very common use case for using system table system.compute.node_timeline to build alerting and remediation.

Check this KB https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-... here use case is to programmatically get a cluster’s memory usage.

To check memory usage (in bytes), join the table node_timeline with the table node_types. Run the following code in a notebook, through a job, or with Databricks SQL.

select cluster_id, instance_id, start_time, end_time, round(mem_used_percent / 100 * node_types.memory_mb, 0) as mem_used_mb
from system.compute.node_timeline
join system.compute.node_types using(node_type)
order by start_time desc;

Next you can add a threshold (e.g., mem_used_percent > 90%) and a sustained duration (e.g., COUNT(*) >= 3 consecutive minutes) to filter out transient spikes. Make sure this is stored in a table, can be a delta table monitoring.alert_cluster_restart.

Then schedule the SQL query or notebook to run as a Databricks Job every 5-10 minutes. When the above query has results or the table has entries - which are not addressed, using the rest API you can restart the cluster https://docs.databricks.com/api/workspace/clusters/restart

Disclaimer - not tested, but it sounds this should help, let me know if you face issues while implementing it and I can help.

Thanks!

View solution in original post

1 REPLY 1

NandiniN
Databricks Employee
Databricks Employee

Hey @pranaav93 

A very common use case for using system table system.compute.node_timeline to build alerting and remediation.

Check this KB https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-... here use case is to programmatically get a cluster’s memory usage.

To check memory usage (in bytes), join the table node_timeline with the table node_types. Run the following code in a notebook, through a job, or with Databricks SQL.

select cluster_id, instance_id, start_time, end_time, round(mem_used_percent / 100 * node_types.memory_mb, 0) as mem_used_mb
from system.compute.node_timeline
join system.compute.node_types using(node_type)
order by start_time desc;

Next you can add a threshold (e.g., mem_used_percent > 90%) and a sustained duration (e.g., COUNT(*) >= 3 consecutive minutes) to filter out transient spikes. Make sure this is stored in a table, can be a delta table monitoring.alert_cluster_restart.

Then schedule the SQL query or notebook to run as a Databricks Job every 5-10 minutes. When the above query has results or the table has entries - which are not addressed, using the rest API you can restart the cluster https://docs.databricks.com/api/workspace/clusters/restart

Disclaimer - not tested, but it sounds this should help, let me know if you face issues while implementing it and I can help.

Thanks!