<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Spark Optimization in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99077#M39901</link>
    <description>&lt;P&gt;&lt;STRONG&gt;Optimizing Shuffle Partition Size in Spark for Large Joins&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:&lt;BR /&gt;- The &lt;STRONG&gt;average shuffle write partition size&lt;/STRONG&gt; for the larger table (300 GB) is around &lt;STRONG&gt;800 MB&lt;/STRONG&gt;.&lt;BR /&gt;- The &lt;STRONG&gt;average shuffle write partition size&lt;/STRONG&gt; for the smaller table (5 GB) is just &lt;STRONG&gt;1 MB&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;I've learned that an &lt;STRONG&gt;optimal shuffle write partition size of around 200 MB&lt;/STRONG&gt; is ideal for my use case, but I’m not sure how to achieve this in Spark.&lt;/P&gt;
&lt;P&gt;I've tried the following configurations:&lt;BR /&gt;1. `spark.conf.set("spark.sql.shuffle.partitions", 1000)` — to set the number of shuffle partitions.&lt;BR /&gt;2. `spark.conf.set("spark.sql.adaptive.shuffle.targetPostShuffleInputSize", "150MB")` — to adjust post-shuffle input size.&lt;/P&gt;
&lt;P&gt;Despite these changes, the partition sizes are still not as expected.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;How can I tune the shuffle partition size to around 200 MB in Spark, specifically for the larger table, to optimize join performance?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 18 Nov 2024 07:04:01 GMT</pubDate>
    <dc:creator>genevive_mdonça</dc:creator>
    <dc:date>2024-11-18T07:04:01Z</dc:date>
    <item>
      <title>Spark Optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99077#M39901</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Optimizing Shuffle Partition Size in Spark for Large Joins&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:&lt;BR /&gt;- The &lt;STRONG&gt;average shuffle write partition size&lt;/STRONG&gt; for the larger table (300 GB) is around &lt;STRONG&gt;800 MB&lt;/STRONG&gt;.&lt;BR /&gt;- The &lt;STRONG&gt;average shuffle write partition size&lt;/STRONG&gt; for the smaller table (5 GB) is just &lt;STRONG&gt;1 MB&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;I've learned that an &lt;STRONG&gt;optimal shuffle write partition size of around 200 MB&lt;/STRONG&gt; is ideal for my use case, but I’m not sure how to achieve this in Spark.&lt;/P&gt;
&lt;P&gt;I've tried the following configurations:&lt;BR /&gt;1. `spark.conf.set("spark.sql.shuffle.partitions", 1000)` — to set the number of shuffle partitions.&lt;BR /&gt;2. `spark.conf.set("spark.sql.adaptive.shuffle.targetPostShuffleInputSize", "150MB")` — to adjust post-shuffle input size.&lt;/P&gt;
&lt;P&gt;Despite these changes, the partition sizes are still not as expected.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;How can I tune the shuffle partition size to around 200 MB in Spark, specifically for the larger table, to optimize join performance?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 07:04:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99077#M39901</guid>
      <dc:creator>genevive_mdonça</dc:creator>
      <dc:date>2024-11-18T07:04:01Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99127#M39909</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/47151"&gt;@genevive_mdonça&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;You have calculate the correct number of shuffle partitions for your case considering the cluster configurations.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please follow this doc to calculate it:&amp;nbsp;&lt;A href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide" target="_blank"&gt;https://www.databricks.com/discover/pages/optimize-data-workloads-guide&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 10:49:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99127#M39909</guid>
      <dc:creator>MuthuLakshmi</dc:creator>
      <dc:date>2024-11-18T10:49:03Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99223#M39929</link>
      <description>&lt;P&gt;Have you tried using&amp;nbsp;&lt;SPAN data-sheets-root="1"&gt;spark.sql.files.maxPartitionBytes=209715200&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 17:05:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99223#M39929</guid>
      <dc:creator>Lakshay</dc:creator>
      <dc:date>2024-11-18T17:05:05Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99286#M39947</link>
      <description>&lt;P&gt;Thanks , will go through this&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2024 05:41:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99286#M39947</guid>
      <dc:creator>genevive_mdonça</dc:creator>
      <dc:date>2024-11-19T05:41:28Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99301#M39956</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/47151"&gt;@genevive_mdonça&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;You can use following formula to calculate optimal count of partitions based on size of input data and target partition size:&lt;/P&gt;&lt;P&gt;Input Stage Data 300GB&lt;BR /&gt;Target Size = 200MB&lt;BR /&gt;Optimal Count of Partitions = 300,000 MB / 200 = 1500 partitions&lt;BR /&gt;Spark.conf.set(“spark.sql.shuffle.partitions”,1500)&lt;BR /&gt;Remember, usually partitions should not be less than number of cores&lt;/P&gt;&lt;P&gt;Though, by default&amp;nbsp; Adaptive Query Execution (AQE) should be enabled and&amp;nbsp;Spark can dynamically optimize the partition size based on runtime statistics&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2024 08:54:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-optimization/m-p/99301#M39956</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-11-19T08:54:35Z</dc:date>
    </item>
  </channel>
</rss>

