topic Re: Differences between Spark Cluster Manager and Databricks Cluster Manager? in Data Engineering

Differences between Spark Cluster Manager and Databricks Cluster Manager?

jwilliam — Fri, 30 Sep 2022 07:56:28 GMT

I didn't found any documentation on Databricks Cluster Manager. Could anyone give me some resources on this topic?

Re: Differences between Spark Cluster Manager and Databricks Cluster Manager?

User16752242622 — Fri, 30 Sep 2022 12:58:22 GMT

Hi @John William

Databricks clusters use Spark's Standalone cluster manager. Each Databricks cluster has its own standalone Master and Worker processes run inside of the LXC containers and share a lifecycle with the cluster. Each cluster has a single Driver process, which acts as the sole Spark application for the standalone cluster.

Here is the official Spark Standalone cluster mode doc: https://spark.apache.org/docs/latest/spark-standalone.html

Re: Differences between Spark Cluster Manager and Databricks Cluster Manager?

jwilliam — Wed, 05 Oct 2022 07:35:23 GMT

Hi @Akash Bhat , thank you for your reply. I really surprise that Databricks clusters use Spark's Standalone cluster manager because if I read correctly here, Databricks uses Kubernnetes as cluster manager https://www.databricks.com/blog/2021/08/06/how-we-built-databricks-on-google-kubernetes-engine-gke.html

Re: Differences between Spark Cluster Manager and Databricks Cluster Manager?

User16752242622 — Thu, 06 Oct 2022 18:32:15 GMT

Hi @John William

The cluster manager launches worker instances and starts worker services

The cluster manager issues API calls to a cloud provider (AWS or Azure) in order to obtain these instances for a cluster.

Whereas Databricks on GCP maintains a Google's Kubernetes Engine (GKE) node pools for provisioning the driver node and the executor nodes