Hey @pranaav93
A very common use case for using system table system.compute.node_timeline to build alerting and remediation.
Check this KB https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-... here use case is to programmatically get a cluster’s memory usage.
To check memory usage (in bytes), join the table node_timeline with the table node_types. Run the following code in a notebook, through a job, or with Databricks SQL.
select cluster_id, instance_id, start_time, end_time, round(mem_used_percent / 100 * node_types.memory_mb, 0) as mem_used_mb
from system.compute.node_timeline
join system.compute.node_types using(node_type)
order by start_time desc;
Next you can add a threshold (e.g., mem_used_percent > 90%) and a sustained duration (e.g., COUNT(*) >= 3 consecutive minutes) to filter out transient spikes. Make sure this is stored in a table, can be a delta table monitoring.alert_cluster_restart.
Then schedule the SQL query or notebook to run as a Databricks Job every 5-10 minutes. When the above query has results or the table has entries - which are not addressed, using the rest API you can restart the cluster https://docs.databricks.com/api/workspace/clusters/restart
Disclaimer - not tested, but it sounds this should help, let me know if you face issues while implementing it and I can help.
Thanks!