cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Run driver on spot instance

Erik
Valued Contributor II

The traditional advice seems to be to run the driver on "on demand", and optionally the workers on spot. And this is indeed what happends if one chooses to run with spot instances in Databricks. But I am interested in what happens if we run with a driver which gets evicted? Can we end up with corrupt data?

We have some batch jobs which run as structured streaming every night. They seem like prime candidates to run on 100% spot with retries, but I want to understand why this is not a more common pattern first.

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @Erik, Certainly! Let’s delve into the behaviour of driver and worker nodes in Databricks, especially when it comes to spot instances:

 

Driver Node Failure:

  • If the driver node fails, your entire cluster will fail.
  • The driver node is critical because it orchestrates the execution of tasks across worker nodes.
  • Therefore, it’s recommended to use on-demand instances for the driver to ensure cluster stability.

Worker Node Failure:

  • When a worker node fails, Databricks automatically spawns a new worker node to replace the failed one.
  • The workload resumes seamlessly on the new worker node.
  • This behaviour applies to both Spark and Databricks clusters.

Spot Instances for Workers:

  • Using spot instances for workers is a common pattern to optimize costs.
  • Since worker nodes handle data processing and computation, their failure is less critical.
  • Databricks efficiently manages worker node replacements, ensuring minimal impact on your workload.

Structured Streaming Jobs:

  • Your batch jobs running as structured streaming are indeed prime candidates for spot instances.
  • Structured streaming handles data incrementally, and its fault tolerance mechanisms can cope with worker node failures.
  • By using spot instances, you can achieve cost savings without compromising data integrity.

Why Not More Common?:

  • While spot instances offer cost benefits, some caution arises from the criticality of the driver node.
  • Organizations often prioritize stability and reliability over cost optimization.
  • However, as Databricks continues to enhance its features, spot instances may become more prevalent for various workloads.

In summary, consider using spot instances for workers while ensuring the driver runs on demand. This approach strikes a balance between cost efficiency and reliability. 

 

Happy streaming! 🌟

Erik
Valued Contributor II

Thanks for your answer @Kaniz ! Good overview, and I understand that "driver on-demand and the rest on spot" is a good generall advice. But I am still considering using spot instances for both, and I am left with two concrete questions:

1: Can we end up in a corrupt state if the driver is reclaimed? There are many other scenarios in which a driver can crash/turn off etc, so I assume spark is written to handle this without eating our data, is this correct? (I understand that software can have bugs, my question is if spark is **intended** to be able to handle the case of a driver failure withouth corrupting data, not if you can guarantee that it will actually work in all cases).

2: If we use databricks workflows with retries on the job, and a driver gets reclaimed, will the job get retried? And does it count towards the max retries?

Kaniz
Community Manager
Community Manager

Hi @Erik

 

Certainly! Let’s delve into your questions about Spark and Databricks workflows:

 

Driver Reclamation and Data Integrity:

  • When the driver node running your Spark application fails, the Spark session details are lost, along with all the in-memory data held by the executors. However, Spark is designed to handle this scenario gracefully.
  • If you restart your application, the getOrCreate() method will reinitialize the Spark session from the checkpoint directory and resume processing. This ensures that your data remains intact.
  • On most cluster managers, Spark does not automatically relaunch the driver if it crashes. Therefore, it’s essential to monitor the driver using tools like monit and manually restart it.
  • In the Standalone cluster manager, you can use the --supervise flag when submitting the driver to allow Spark to automatically restart it.
  • The best approach may vary based on your environment and specific requirements.

Databricks Workflows and Retries:

  • Databricks workflows support an automatic retry policy for failed runs. When a task fails, Databricks will automatically retry it according to the configured policy.
  • If a driver gets reclaimed during a workflow, the job will be retried. The retry interval is calculated between the start of the failed run and the subsequent retry run.
  • These retries are essential for maintaining reliability and ensuring that transient issues do not disrupt your data processing.
  • The retries do count towards the maximum retries specified in your workflow configuration.

Remember that while Spark and Databricks provide robust mechanisms for handling failures, it’s essential to design your workflows carefully and consider factors like checkpointing, data durability, and fault tolerance to ensure data integrity and reliability. 🚀.