cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
SashankKotta
Databricks Employee
Databricks Employee

In the world of cloud computing and big data processing, efficiency is key. Databricks, a popular platform for large-scale data analytics, has implemented a clever mechanism to optimize cluster startup times. This blog post delves into the intricacies of Databricks cluster creation and reveals an often-overlooked feature that can significantly reduce wait times.

Understanding Databricks Cluster Creation:

When a Databricks cluster is initiated, it typically follows these steps:

  1. Virtual Machine (VM) Provisioning: New VMs are created in the underlying cloud infrastructure for each worker node.
  2. Spark Image Installation: The appropriate Spark images are installed on these instances.
  3. Cluster Initialization: The cluster state changes from "Cluster Starting" to the "Driver Healthy" state(the point where the cluster can be used).

This process usually takes 4-5 minutes. However, many users have noticed that it can sometimes be as quick as 2-3 minutes. What's behind this variation in startup times?

The Secret Sauce:

VM Reuse, The key to these faster startup times lies in Databricks intelligent VM reuse mechanism. Here's how it works:

  1. VM Retention: When a Databricks cluster is terminated, the underlying VMs don't immediately disappear. Instead, they remain available(It is not being billed to the cluster but are explicitly kept alive for reusing purposes) for a short period.
  2. Reuse Window: These VMs stay alive for approximately 5 minutes after cluster termination.
  3. Efficient Allocation: If a new cluster is started in the same workspace within the window where a previous VM is being kept alive and requires the same VM type (i.e. same worker/driver type configuration), Databricks will reuse these existing VMs.
  4. Fresh Start: If more than 5 minutes have elapsed since the previous cluster's termination and the older VMs are no longer available, fresh ones are provisioned.

This clever reuse strategy can cut cluster startup times nearly in half, significantly increasing productivity and resource efficiency.

Monitoring VMs associated with Databricks Clusters:

Please see the below example. This cluster has 4 workers and 1 driver as shown below(Figure 1).

Figure: 1

SashankKotta_0-1728641262845.png

To monitor the VMs in Azure, navigate to the Virtual Machines section in the Azure portal and search for the workspace name. You can see the VMs(4 workers and 1 driver) created, as shown below(Figure 2).

Figure: 2

SashankKotta_2-1731995218678.png

Observing VM Reuse Through System Tables: 

For those interested in the technical details of this VM reuse, it can be observed in Databricks system tables. Specifically, the system.compute.node_timeline table captures both cluster and VM details. Here's what to look for:

  • cluster_id: This represents the Databricks cluster ID.
  • instance_id: This corresponds to the Azure VM name.

By examining this table, you may observe multiple cluster_id entries associated with a single instance_id, clearly indicating that the same VM has been reused across different Databricks clusters.

Consider the following scenario(as shown in Figure 3) observed in the system.compute.node_timeline table:

1. Cluster A (cluster_id: 0910-223227-j5z0hdm4)

  • Termination time: 22:43

2. Cluster B (cluster_id: 0910-224002-il9wylgl)

  • Start time: 22:44
  • Node type: Same as Cluster A
  • Instance ID: [specific VM identifier]

In this case, we observe that, Cluster A terminated at 22:43, and just one minute later, at 22:44, Cluster B was initiated at 22:44 using the same node-type configuration. The crucial detail here is that the same VM instance(Same instance id) was reallocated to the new cluster (Cluster B). This reallocation occurred because:

  1. The new cluster was started within the 5-minute window after the previous cluster's termination.
  2. The new cluster requested the same node type as a recently terminated cluster.

This example perfectly demonstrates Databricks efficient VM reuse strategy and the resulting efficiencies in cluster startup times. 

Figure: 3

SashankKotta_3-1731995280489.png

 

Conclusion:

Understanding this VM reuse feature can help data engineers and analysts optimize their workflows. By timing cluster creations to take advantage of this reuse window, teams can significantly reduce wait times and improve overall efficiency. This is a prime example of how cloud platforms like Databricks continue to innovate, finding clever ways to maximize resources and minimize downtime. As we continue to push the boundaries of big data processing, these seemingly small optimizations can add up to substantial time and cost savings. Keep this feature in mind the next time you're working with Databricks clusters – it might just give your data processing pipeline the speed boost it needs!