<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic All applications stuck in Waiting State on Standalone Spark Cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/all-applications-stuck-in-waiting-state-on-standalone-spark/m-p/17441#M11462</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Spark Standalone Cluster Configuration (Spark 3.0.0)-&lt;/P&gt;
&lt;UL&gt;&lt;LI&gt;1 Master&lt;/LI&gt;&lt;LI&gt;2 Workers (4 cores each)&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;I am using Airflow SparkSubmitOperator to submit the job to Spark Master in &lt;I&gt;&lt;B&gt;Cluster&lt;/B&gt;&lt;/I&gt; mode. There are multiple(~20) DAGs on airflow submitting jobs to Spark. These DAGs are scheduled based on cron expressions. Also, all Spark jobs have the same configurations - 1 Driver core, 1 Executor with 1 core.&lt;/P&gt;
&lt;P&gt;Now at some point Airflow is triggering 8 jobs simultaneously and for these 8 jobs corresponding 8 drivers are spawned on the Spark Workers. But as I have only 8 cores on my workers, these drivers end up taking all the cores and don't allow any executor to spawn, resulting in all 8 applications being in the &lt;I&gt;&lt;B&gt;WAITING&lt;/B&gt;&lt;/I&gt; state. Spark Master UI Snapshot. As you can see in the image the cluster is in kind of a deadlock state - nothing moves. The airflow keeps on submitting more applications which queues up - when I kill any of the WAITING state applications, a driver from the queued app spawns.&lt;/P&gt;
&lt;P&gt;I can't change the cron expressions on DAGs to spread these applications out, as these are configured by the users. I tried to solve this problem by following approaches -&lt;/P&gt;
&lt;OL&gt;&lt;LI&gt;&lt;I&gt;Limit number of concurrent applications that can be run on the spark cluster&lt;/I&gt; - I couldn't find any such config for the Standalone Spark cluster&lt;/LI&gt;&lt;LI&gt;&lt;I&gt;Associate a dedicated worker for drivers&lt;/I&gt; - This way I can use one worker to spawn drivers and the other to spawn executors dedicatedly (Doesn't make much sense for a distributed system as if this node goes down the whole cluster is unusable - but in the current situation, I would take that). After lots of researching only suggestion, I could find was this - &lt;A href="https://stackoverflow.com/a/45962256" target="test_blank"&gt;https://stackoverflow.com/a/45962256&lt;/A&gt;, But this one didn't work as suggested - the driver was still spawning on both the workers.&lt;/LI&gt;&lt;/OL&gt;
&lt;P&gt;Is there any other approach or configuration that I can try to resolve this issue?&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 22 Jul 2021 16:13:14 GMT</pubDate>
    <dc:creator>ShivamRunthala</dc:creator>
    <dc:date>2021-07-22T16:13:14Z</dc:date>
    <item>
      <title>All applications stuck in Waiting State on Standalone Spark Cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/all-applications-stuck-in-waiting-state-on-standalone-spark/m-p/17441#M11462</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Spark Standalone Cluster Configuration (Spark 3.0.0)-&lt;/P&gt;
&lt;UL&gt;&lt;LI&gt;1 Master&lt;/LI&gt;&lt;LI&gt;2 Workers (4 cores each)&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;I am using Airflow SparkSubmitOperator to submit the job to Spark Master in &lt;I&gt;&lt;B&gt;Cluster&lt;/B&gt;&lt;/I&gt; mode. There are multiple(~20) DAGs on airflow submitting jobs to Spark. These DAGs are scheduled based on cron expressions. Also, all Spark jobs have the same configurations - 1 Driver core, 1 Executor with 1 core.&lt;/P&gt;
&lt;P&gt;Now at some point Airflow is triggering 8 jobs simultaneously and for these 8 jobs corresponding 8 drivers are spawned on the Spark Workers. But as I have only 8 cores on my workers, these drivers end up taking all the cores and don't allow any executor to spawn, resulting in all 8 applications being in the &lt;I&gt;&lt;B&gt;WAITING&lt;/B&gt;&lt;/I&gt; state. Spark Master UI Snapshot. As you can see in the image the cluster is in kind of a deadlock state - nothing moves. The airflow keeps on submitting more applications which queues up - when I kill any of the WAITING state applications, a driver from the queued app spawns.&lt;/P&gt;
&lt;P&gt;I can't change the cron expressions on DAGs to spread these applications out, as these are configured by the users. I tried to solve this problem by following approaches -&lt;/P&gt;
&lt;OL&gt;&lt;LI&gt;&lt;I&gt;Limit number of concurrent applications that can be run on the spark cluster&lt;/I&gt; - I couldn't find any such config for the Standalone Spark cluster&lt;/LI&gt;&lt;LI&gt;&lt;I&gt;Associate a dedicated worker for drivers&lt;/I&gt; - This way I can use one worker to spawn drivers and the other to spawn executors dedicatedly (Doesn't make much sense for a distributed system as if this node goes down the whole cluster is unusable - but in the current situation, I would take that). After lots of researching only suggestion, I could find was this - &lt;A href="https://stackoverflow.com/a/45962256" target="test_blank"&gt;https://stackoverflow.com/a/45962256&lt;/A&gt;, But this one didn't work as suggested - the driver was still spawning on both the workers.&lt;/LI&gt;&lt;/OL&gt;
&lt;P&gt;Is there any other approach or configuration that I can try to resolve this issue?&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Jul 2021 16:13:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/all-applications-stuck-in-waiting-state-on-standalone-spark/m-p/17441#M11462</guid>
      <dc:creator>ShivamRunthala</dc:creator>
      <dc:date>2021-07-22T16:13:14Z</dc:date>
    </item>
  </channel>
</rss>

