<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks Compute Metrics Alerts in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-compute-metrics-alerts/m-p/135008#M50255</link>
    <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/183306"&gt;@pranaav93&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A very common use case for using system table&amp;nbsp;&lt;CODE&gt;system.compute.node_timeline&lt;/CODE&gt; to build alerting and remediation.&lt;/P&gt;
&lt;P&gt;Check this KB&amp;nbsp;&lt;A href="https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline" target="_blank"&gt;https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline&lt;/A&gt;&amp;nbsp;here use case is&lt;SPAN&gt;&amp;nbsp;to programmatically get a cluster’s memory usage.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;To check memory usage (in bytes), join the table&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;node_timeline&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;with the table&amp;nbsp;&lt;CODE&gt;node_types&lt;/CODE&gt;. Run the following code in a notebook, through a job, or with Databricks SQL.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-plain"&gt;select cluster_id, instance_id, start_time, end_time, round(mem_used_percent / 100 * node_types.memory_mb, 0) as mem_used_mb
from system.compute.node_timeline
join system.compute.node_types using(node_type)
order by start_time desc;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Next you can add a&amp;nbsp;threshold (e.g., &lt;EM&gt;mem_used_percent &amp;gt; 90%&lt;/EM&gt;) and a sustained duration (e.g., &lt;EM&gt;COUNT(*) &amp;gt;= 3&lt;/EM&gt; consecutive minutes) to filter out transient spikes. Make sure this is stored in a table, can be a delta table&amp;nbsp;&lt;EM&gt;monitoring.alert_cluster_restart.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Then schedule the SQL query or notebook to run as a Databricks Job every 5-10 minutes. When the above query has results or the table has entries - which are not addressed, using the rest API you can restart the cluster &lt;A href="https://docs.databricks.com/api/workspace/clusters/restart" target="_blank"&gt;https://docs.databricks.com/api/workspace/clusters/restart&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Disclaimer - not tested, but it sounds this should help, let me know if you face issues while implementing it and I can help.&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Wed, 15 Oct 2025 14:06:39 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2025-10-15T14:06:39Z</dc:date>
    <item>
      <title>Databricks Compute Metrics Alerts</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-compute-metrics-alerts/m-p/134807#M50195</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;Im looking for some implementation ideas where i can use information from the system.compute.node_timeline table to catch memory spikes and if above a given threshold restart the cluster through an API call.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Have any of you implemented a similar solution? Any references would help.&lt;/P&gt;</description>
      <pubDate>Tue, 14 Oct 2025 07:08:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-compute-metrics-alerts/m-p/134807#M50195</guid>
      <dc:creator>pranaav93</dc:creator>
      <dc:date>2025-10-14T07:08:37Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Compute Metrics Alerts</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-compute-metrics-alerts/m-p/135008#M50255</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/183306"&gt;@pranaav93&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A very common use case for using system table&amp;nbsp;&lt;CODE&gt;system.compute.node_timeline&lt;/CODE&gt; to build alerting and remediation.&lt;/P&gt;
&lt;P&gt;Check this KB&amp;nbsp;&lt;A href="https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline" target="_blank"&gt;https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline&lt;/A&gt;&amp;nbsp;here use case is&lt;SPAN&gt;&amp;nbsp;to programmatically get a cluster’s memory usage.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;To check memory usage (in bytes), join the table&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;node_timeline&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;with the table&amp;nbsp;&lt;CODE&gt;node_types&lt;/CODE&gt;. Run the following code in a notebook, through a job, or with Databricks SQL.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-plain"&gt;select cluster_id, instance_id, start_time, end_time, round(mem_used_percent / 100 * node_types.memory_mb, 0) as mem_used_mb
from system.compute.node_timeline
join system.compute.node_types using(node_type)
order by start_time desc;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Next you can add a&amp;nbsp;threshold (e.g., &lt;EM&gt;mem_used_percent &amp;gt; 90%&lt;/EM&gt;) and a sustained duration (e.g., &lt;EM&gt;COUNT(*) &amp;gt;= 3&lt;/EM&gt; consecutive minutes) to filter out transient spikes. Make sure this is stored in a table, can be a delta table&amp;nbsp;&lt;EM&gt;monitoring.alert_cluster_restart.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Then schedule the SQL query or notebook to run as a Databricks Job every 5-10 minutes. When the above query has results or the table has entries - which are not addressed, using the rest API you can restart the cluster &lt;A href="https://docs.databricks.com/api/workspace/clusters/restart" target="_blank"&gt;https://docs.databricks.com/api/workspace/clusters/restart&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Disclaimer - not tested, but it sounds this should help, let me know if you face issues while implementing it and I can help.&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Wed, 15 Oct 2025 14:06:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-compute-metrics-alerts/m-p/135008#M50255</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-10-15T14:06:39Z</dc:date>
    </item>
  </channel>
</rss>

