<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Why Databricks spawns multiple jobs in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12744#M7509</link>
    <description>&lt;P&gt;I have a Delta table spark101.airlines (sourced from `/databricks-datasets/airlines/`) partitioned by `Year`. My `spark.sql.shuffle.partitions` is set to default 200. I run a simple query:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;select Origin, count(*) 
from spark101.airlines
group by Origin&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;B&gt;Stage 1&lt;/B&gt;: Data is read into 17 partitions, which resembles my `spark.sql.files.maxPartitionBytes`. This stage also pre-aggregates the data within the scope of each executor and saves it into 200 partitions.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I would expect:&lt;/P&gt;&lt;P&gt;&lt;B&gt;Stage 2:&lt;/B&gt; It should spawn 200 tasks to read and aggregate partitions from the previous stage.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I've god instead:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1703i0E45B5ED02E5BCFD/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;All the other stages adds up to 200, but why there are separate jobs spawned?&lt;/P&gt;</description>
    <pubDate>Sun, 24 Jul 2022 12:31:17 GMT</pubDate>
    <dc:creator>pawelmitrus</dc:creator>
    <dc:date>2022-07-24T12:31:17Z</dc:date>
    <item>
      <title>Why Databricks spawns multiple jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12744#M7509</link>
      <description>&lt;P&gt;I have a Delta table spark101.airlines (sourced from `/databricks-datasets/airlines/`) partitioned by `Year`. My `spark.sql.shuffle.partitions` is set to default 200. I run a simple query:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;select Origin, count(*) 
from spark101.airlines
group by Origin&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;B&gt;Stage 1&lt;/B&gt;: Data is read into 17 partitions, which resembles my `spark.sql.files.maxPartitionBytes`. This stage also pre-aggregates the data within the scope of each executor and saves it into 200 partitions.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I would expect:&lt;/P&gt;&lt;P&gt;&lt;B&gt;Stage 2:&lt;/B&gt; It should spawn 200 tasks to read and aggregate partitions from the previous stage.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I've god instead:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1703i0E45B5ED02E5BCFD/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;All the other stages adds up to 200, but why there are separate jobs spawned?&lt;/P&gt;</description>
      <pubDate>Sun, 24 Jul 2022 12:31:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12744#M7509</guid>
      <dc:creator>pawelmitrus</dc:creator>
      <dc:date>2022-07-24T12:31:17Z</dc:date>
    </item>
    <item>
      <title>Re: Why Databricks spawns multiple jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12745#M7510</link>
      <description>&lt;P&gt;jobs get spawned on actions.&lt;/P&gt;&lt;P&gt;So it seems you have multiple actions in your code.&lt;/P&gt;&lt;P&gt;Is the code snippet you posted the whole notebook?&lt;/P&gt;</description>
      <pubDate>Mon, 25 Jul 2022 07:26:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12745#M7510</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-07-25T07:26:37Z</dc:date>
    </item>
    <item>
      <title>Re: Why Databricks spawns multiple jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12746#M7511</link>
      <description>&lt;P&gt;Yeah, this is all I've got. The things I should also mention:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;databricks runtime 10.4 LTS&lt;/LI&gt;&lt;LI&gt;I have disabled AQE&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;It looks like databricks have some kind of approach of creating jobs / stages in the way that:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;start with 1&lt;/LI&gt;&lt;LI&gt;multiply by 4, if not enough then...&lt;/LI&gt;&lt;LI&gt;multiply by 5, if not enough then...&lt;/LI&gt;&lt;LI&gt;multiply by 5, if not enough then...&lt;/LI&gt;&lt;LI&gt;take the rest &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;so eventually it is (1, 4, 20, 100, 75) = 200&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2022 17:09:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12746#M7511</guid>
      <dc:creator>pawelmitrus</dc:creator>
      <dc:date>2022-07-26T17:09:24Z</dc:date>
    </item>
    <item>
      <title>Re: Why Databricks spawns multiple jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12747#M7512</link>
      <description>&lt;P&gt;I think it is something that Databricks does when running a query which result is returned to the notebook. When I write this sql statement to the storage, then it's only 1 job with 2 stages - as expected.&lt;/P&gt;</description>
      <pubDate>Sun, 31 Jul 2022 13:05:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12747#M7512</guid>
      <dc:creator>pawelmitrus</dc:creator>
      <dc:date>2022-07-31T13:05:24Z</dc:date>
    </item>
    <item>
      <title>Re: Why Databricks spawns multiple jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12748#M7513</link>
      <description>&lt;P&gt;Could you please paste the query plan here to analyse the issue&lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 07:01:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-databricks-spawns-multiple-jobs/m-p/12748#M7513</guid>
      <dc:creator>User16753725469</dc:creator>
      <dc:date>2022-09-01T07:01:18Z</dc:date>
    </item>
  </channel>
</rss>

