cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

All applications stuck in Waiting State on Standalone Spark Cluster

ShivamRunthala
New Contributor

Spark Standalone Cluster Configuration (Spark 3.0.0)-

  • 1 Master
  • 2 Workers (4 cores each)

I am using Airflow SparkSubmitOperator to submit the job to Spark Master in Cluster mode. There are multiple(~20) DAGs on airflow submitting jobs to Spark. These DAGs are scheduled based on cron expressions. Also, all Spark jobs have the same configurations - 1 Driver core, 1 Executor with 1 core.

Now at some point Airflow is triggering 8 jobs simultaneously and for these 8 jobs corresponding 8 drivers are spawned on the Spark Workers. But as I have only 8 cores on my workers, these drivers end up taking all the cores and don't allow any executor to spawn, resulting in all 8 applications being in the WAITING state. Spark Master UI Snapshot. As you can see in the image the cluster is in kind of a deadlock state - nothing moves. The airflow keeps on submitting more applications which queues up - when I kill any of the WAITING state applications, a driver from the queued app spawns.

I can't change the cron expressions on DAGs to spread these applications out, as these are configured by the users. I tried to solve this problem by following approaches -

  1. Limit number of concurrent applications that can be run on the spark cluster - I couldn't find any such config for the Standalone Spark cluster
  2. Associate a dedicated worker for drivers - This way I can use one worker to spawn drivers and the other to spawn executors dedicatedly (Doesn't make much sense for a distributed system as if this node goes down the whole cluster is unusable - but in the current situation, I would take that). After lots of researching only suggestion, I could find was this - https://stackoverflow.com/a/45962256, But this one didn't work as suggested - the driver was still spawning on both the workers.

Is there any other approach or configuration that I can try to resolve this issue?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group