instance_id in compute.node_timelines

pranitha
New Contributor II

I am trying to fetch active worker nodes from system tables using the code like below:

select count(distinct instance_id)

from system.compute.node_timelines where cluster_id = "xx"

groupy by instance_id,start_time,end_times

It gives an output like 20 but in the my cluster configuration I have set max worker nodes is 15. How is this possible. Does instance_id refers to the worker nodes ?

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @pranitha,

The instance_id field refers to the specific instance of a node, which can include both driver and worker nodes. In your query, you are counting distinct instance_ids, which might include instances that are not currently active or are driver nodes, leading to a higher count than the maximum number of worker nodes set in your cluster configuration.

Additionally, the node_timeline table captures node-level resource utilization data at minute granularity, and each record contains data for a given minute of time per instance. This means that if nodes are frequently added and removed (e.g., due to autoscaling), the count of distinct instance_ids over a period of time can exceed the maximum number of worker nodes set for the cluster at any single point in time.

Alberto_Umana
Databricks Employee
Databricks Employee

To ensure you are only counting active worker nodes, you can use:

SELECT COUNT(DISTINCT instance_id)

FROM system.compute.node_timelines

WHERE cluster_id = 'xx'

  AND driver = false

  AND end_time IS NULL

pranitha
New Contributor II

Hi @Alberto_Umana , 

Thanks for replying.

Even if we add the driver node it should be around 16-17 right, not like 20. I checked for al the clusters, for every cluster there is a difference of 5-7 nodes between max_worker count and count(distinct instance_id) for a specific minute.