topic Re: Databricks Spark Vs Spark on Yarn in Data Engineering

Databricks Spark Vs Spark on Yarn

brickster_2018 — Wed, 23 Jun 2021 15:25:02 GMT

I am moving my Spark workloads from EMR/on-premise Spark cluster to Databricks. I understand Databricks Spark is different from Yarn. How is the Databricks architecture different from yarn?

Re: Databricks Spark Vs Spark on Yarn

brickster_2018 — Wed, 23 Jun 2021 22:48:33 GMT

Users often compare Databricks cluster vs Yarn Cluster. It's not an Apple to Apple comparison.

A Databricks cluster should be compared to a Spark Application that is submitted on Yarn. A Spark Application on Yarn will have a driver container and executor containers launched on the cluster nodes. The Application Master will run inside the Driver container (Yarn-Cluster mode).

A Databricks cluster also has a Driver container and the executor containers launched on the cluster nodes. Unlike Yarn, we launch only one executor per virtual machine. Application master in Yarn can be compared with the Chauffeur service in Databricks.

There are several benefits compared to Yarn in Databricks in this comparison:

Support of multiple languages/sessions within the same cluster.
Optimized and improved auto-scaling features. The auto-scaling algorithm used in Databricks is very much efficient than the Dynamic allocation feature in Yarn
Faster and reliable with Spark's standalone scheduler.

Re: Databricks Spark Vs Spark on Yarn

de-qrosh — Wed, 29 Jan 2025 16:47:59 GMT

What about the disadvantages?

How can I separate multiple jobs running on the same cluster cleanly in the logs and same in the spark-ui?

Re: Databricks Spark Vs Spark on Yarn

Lakshay — Fri, 31 Jan 2025 19:02:53 GMT

Ideally, you don't want to run multiple jobs on the same cluster. There is no clean way of separating the driver logs for each job. However, in spark UI, you can use the run IDs and job IDs to separate out the spark jobs for a particular job.

Re: Databricks Spark Vs Spark on Yarn

de-qrosh — Fri, 31 Jan 2025 22:37:17 GMT

But isn’t that a hard disadvantage compared to yarn clusters?

And the way I understood workflows (and the team behind the UI component among other things), we clearly shall reuse the same compute cluster and run parallel tasks.

If I would run spark-submits would the logs be separated as finally separate sessions would spawn?