Databricks Community

abd · ‎06-28-2022

What will happen if a driver node will fail?

What will happen if one of the worker node fails?

Is it same in Spark and Databricks or Databricks provide additional features to overcome these situations?

Cedric · ‎06-28-2022

If the driver node fails your cluster will fail. If the worker node fails, Databricks will spawn a new worker node to replace the failed node and resumes the workload. Generally it is recommended to assign a on-demand instance for your driver and spot instances as worker nodes.

As for a comparison between Spark and Databricks, please visit our comparison page (https://databricks.com/spark/comparing-databricks-to-apache-spark).

View solution in original post

Hubert-Dudek · ‎06-28-2022

a worker is not a problem as it is RDD, so the dataset will survive on other workers, and new workers will be automatically deployed in databricks,
a driver is critical as without a driver whole cluster will fail (that's why you shouldn't use spot instances for the driver, but for workers is not a problem)

My blog: https://databrickster.medium.com/

abd · ‎06-28-2022

So the data is copied on other worker nodes?

Or the data on that worker node is lost?

Cedric · ‎06-28-2022

If the driver node fails your cluster will fail. If the worker node fails, Databricks will spawn a new worker node to replace the failed node and resumes the workload. Generally it is recommended to assign a on-demand instance for your driver and spot instances as worker nodes.

As for a comparison between Spark and Databricks, please visit our comparison page (https://databricks.com/spark/comparing-databricks-to-apache-spark).

Prabakar · ‎06-28-2022

Good one @Cedric Law Hing Ping

abd · ‎06-28-2022

So even if worker node fails between the job. It will resume the job?

And what about the data on the worker node?

Is it lost?

Cedric · ‎06-28-2022

Yes, the cluster will treat it as a lost worker and schedules the workload to a different worker. Temporary data on the worker will be lost and has to be recomputed by another worker node.