<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a che in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/44063#M27599</link>
    <description>&lt;P&gt;I agree at 100% with you. How can you find a proper value when the workload is completely between the first run and the following regular runs?&lt;/P&gt;</description>
    <pubDate>Fri, 08 Sep 2023 07:52:48 GMT</pubDate>
    <dc:creator>Thor</dc:creator>
    <dc:date>2023-09-08T07:52:48Z</dc:date>
    <item>
      <title>Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a checkpoint</title>
      <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27389#M19263</link>
      <description>&lt;P&gt;Question about spark checkpoints and offsets in a running stream&lt;/P&gt;&lt;P&gt;When the stream started I needed tons of partitions, so we've set it with spark.conf&amp;nbsp;to 5000&lt;/P&gt;&lt;P&gt;As expected offsets in the checkpoint contain this info and the job used this value.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Then we've stopped the job, and changed the number of partitions to 400, with spark.conf again&lt;/P&gt;&lt;P&gt;I've expected the next batch to still use previous value (because it's in the offset) - but when the new offset is calculated ( when current batch ends)&amp;nbsp;to use the new value .Instead I see the 5000 value still in newly created offsets.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;While some tasks in the job now use the new 400 number, other tasks use the 5000 number! which is basically killing us now.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;I'm quite sure that with a previous version of spark (this job is spark 3.2 in databricks runtime 10.2) this worked as expected, but with this job not any more.. Any idea what We're doing wrong? I'd be glad for help with this or with any clue how I can move the job back to 400 partitions&lt;/P&gt;</description>
      <pubDate>Tue, 22 Feb 2022 14:55:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27389#M19263</guid>
      <dc:creator>alonisser</dc:creator>
      <dc:date>2022-02-22T14:55:46Z</dc:date>
    </item>
    <item>
      <title>Re: Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a checkpoint</title>
      <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27390#M19264</link>
      <description>&lt;P&gt;Hello, @Alon Nisser​&amp;nbsp;- My name is Piper, and I'm a moderator for Databricks. Thank you for coming to us with this question. We will give the members a chance to respond before we come back to this if we need to. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance for your patience.&lt;/P&gt;</description>
      <pubDate>Tue, 22 Feb 2022 18:19:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27390#M19264</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-02-22T18:19:31Z</dc:date>
    </item>
    <item>
      <title>Re: Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a checkpoint</title>
      <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27391#M19265</link>
      <description>&lt;P&gt;Hi @Alon Nisser​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;if you change this shuffle partitions configuration, the conf is persisted in the checkpoint, therefore the stream itself will continue to use the old value for stateful aggregations. If you want to use a new value, then you will need to use a new checkpoint.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Mar 2022 01:32:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27391#M19265</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-03-02T01:32:43Z</dc:date>
    </item>
    <item>
      <title>Re: Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a checkpoint</title>
      <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27392#M19266</link>
      <description>&lt;P&gt;This is a strange behavior, when a new checkpoint is being calculated (on the end of a batch) why wouldn't the stream use the new spark.conf shuffle.partitions - it's for a new microbatch?&lt;/P&gt;&lt;P&gt;Just removing the checkpoints, for a stream that's running for a long time and where a full backfill doesn't make sense, is a poor solution.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I've found out I can edit the checkpoint and change the number, and it works, but it's an ugly workaround hack&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Mar 2022 07:12:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27392#M19266</guid>
      <dc:creator>alonisser</dc:creator>
      <dc:date>2022-03-02T07:12:08Z</dc:date>
    </item>
    <item>
      <title>Re: Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a checkpoint</title>
      <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27393#M19267</link>
      <description>&lt;P&gt;Hi @Alon Nisser​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I undertand your point. Modifying checkpoint folder/files could produced other issues, so it recommended to use a new checkpoint  instead.&lt;/P&gt;</description>
      <pubDate>Mon, 07 Mar 2022 22:14:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27393#M19267</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-03-07T22:14:55Z</dc:date>
    </item>
    <item>
      <title>Re: Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a checkpoint</title>
      <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27394#M19268</link>
      <description>&lt;P&gt;@Jose Gonzalez​&amp;nbsp;thanks for that information! This is super useful. I was struggling why my streaming still using 200 partitions. This is quite a paint for me because changing checkpoint will insert all data from the source. Do you know where this can be reported so it can be fixed sometime in the future? &lt;/P&gt;</description>
      <pubDate>Thu, 08 Sep 2022 11:33:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/27394#M19268</guid>
      <dc:creator>Leszek</dc:creator>
      <dc:date>2022-09-08T11:33:51Z</dc:date>
    </item>
    <item>
      <title>Re: Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a che</title>
      <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/44063#M27599</link>
      <description>&lt;P&gt;I agree at 100% with you. How can you find a proper value when the workload is completely between the first run and the following regular runs?&lt;/P&gt;</description>
      <pubDate>Fri, 08 Sep 2023 07:52:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/44063#M27599</guid>
      <dc:creator>Thor</dc:creator>
      <dc:date>2023-09-08T07:52:48Z</dc:date>
    </item>
    <item>
      <title>Re: Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a che</title>
      <link>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/92643#M38489</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/29880"&gt;@jose_gonzalez&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25768"&gt;@alonisser&lt;/a&gt;&amp;nbsp; we are facing similar issue in our pipelines wherein it uses a wide transformations using groupBy which is using default 200 partitions. We want to change it to 20 or 40 partitions and did that change in asset bundle and deployed update to the pipeline however it is not taking effect.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you clarify how to change checkpoint location so this change can take effect?&lt;/P&gt;</description>
      <pubDate>Thu, 03 Oct 2024 12:18:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/changing-shuffle-partitions-with-spark-conf-in-a-spark-stream/m-p/92643#M38489</guid>
      <dc:creator>PushkarDeole</dc:creator>
      <dc:date>2024-10-03T12:18:09Z</dc:date>
    </item>
  </channel>
</rss>

