topic Slow running Spark job issue - due to the unknown spark stages created by Databircks Compute cluster in Data Engineering

Slow running Spark job issue - due to the unknown spark stages created by Databircks Compute cluster

anil_reddaboina — Fri, 06 Jun 2025 05:13:38 GMT

Hi Team,

Recently we migrated the spark jobs from self hosted spark(YARN) Cluster to Databricks.

Currently we are using the Databricks workflows with Job_Compute clusters and the Job Type - Spark JAR type execution, so when we run the job in databricks, what we obsererved is its creating the extra job stages like mentioned in the below image. the problem here which is also taking a singnficant time which is causing the delaying of the total job runtime.
Databricks Run time: 16.1
instance type - Standard_E16ds_v4
Can you please add your suggestions.

Re: Slow running Spark job issue - due to the unknown spark stages created by Databircks Compute clu

Brahmareddy — Fri, 06 Jun 2025 11:24:30 GMT

Hi Anil,

How are you doing today?, As per my understanding, When you move Spark jobs from a self-hosted YARN cluster to Databricks and run them using Spark JARs on job compute clusters, it's normal to see a few extra stages added in the job execution plan. These stages are usually due to Databricks’ built-in features like adaptive query execution (AQE), automatic optimizations, or internal tracking. While these help in performance tuning, they can sometimes increase the total runtime if not tuned well. I’d suggest trying to disable AQE temporarily (spark.sql.adaptive.enabled to false) and reviewing the job stages in the Spark UI to see what’s taking time. Also, double-check if broadcast joins or data skew might be causing shuffle delays. Using compute pools can also reduce cold-start delays if you're launching new clusters for each run. A bit of tuning here can make a big difference — happy to help further if you share a specific job plan or logs!

Regards,

Brahma

Re: Slow running Spark job issue - due to the unknown spark stages created by Databircks Compute clu

anil_reddaboina — Fri, 06 Jun 2025 12:33:43 GMT

Hey Brahma,
Thanks for your reply. As a first step I will disable AQE config and test it.

We are using the node pools with job_compute cluster type so that its not spinning up a new cluster for each Job.

I'm configuring the below two configs also, do you think these configs cause any side effects

"spark.databricks.io.cache.enabled": "true",

"spark.databricks.io.cache.maxDiskUsage": "50g",

Thanks,

Anil