<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Why is Delta Lake creating a 238.0TiB shuffle on merge? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8778#M4319</link>
    <description>&lt;P&gt;Hi @Jordan Yaker​,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! &lt;/P&gt;&lt;P&gt;Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 25 Apr 2023 11:01:19 GMT</pubDate>
    <dc:creator>Vartika</dc:creator>
    <dc:date>2023-04-25T11:01:19Z</dc:date>
    <item>
      <title>Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8771#M4312</link>
      <description>&lt;P&gt;I'm frankly at a loss here.  I have a task that is consistently performing just awfully. I took some time this morning to try and debug it and the physical plan is showing a 238TiB shuffle:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;== Physical Plan ==
AdaptiveSparkPlan (40)
+- == Current Plan ==
   SerializeFromObject (22)
   +- MapPartitions (21)
      +- DeserializeToObject (20)
         +- Project (19)
            +- ObjectHashAggregate (18)
               +- Exchange (17)
                  +- ObjectHashAggregate (16)
                     +- ObjectHashAggregate (15)
                        +- ShuffleQueryStage (14), Statistics(sizeInBytes=238.0 TiB)
                           +- Exchange (13)
                              +- ObjectHashAggregate (12)
                                 +- * Project (11)
                                    +- CartesianProduct Inner (10)
                                       :- * Project (5)
                                       :  +- * Filter (4)
                                       :     +- * Project (3)
                                       :        +- * ColumnarToRow (2)
                                       :           +- Scan parquet  (1)
                                       +- * Project (9)
                                          +- * Project (8)
                                             +- * ColumnarToRow (7)
                                                +- Scan parquet  (6)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I could understand this number if I was working with a lot of data.  I'm not. The Cartesian Product in this query produces 125 rows as shown below so it's not my merge logic&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/610i32FE27639AACE346/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;Additionally, the output table isn't very big either; it's 15 files with no file larger than 10MB (&lt;B&gt;NOTE: &lt;/B&gt;I could definitely do some repartitioning here to have a better setup but that's another story).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I feel like I'm at the end of my wits with this problem.  Any ideas would be appreciated.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Feb 2023 16:28:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8771#M4312</guid>
      <dc:creator>JordanYaker</dc:creator>
      <dc:date>2023-02-24T16:28:53Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8772#M4313</link>
      <description>&lt;P&gt;So I'm not too sure of the problem, but I'll walk you through my thinking and ideas.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The deserialize/map/serialize is that a case class in Scala?  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;How big are the two tables you're joining?  &lt;/P&gt;</description>
      <pubDate>Fri, 24 Feb 2023 18:48:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8772#M4313</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-02-24T18:48:25Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8773#M4314</link>
      <description>&lt;P&gt;@Joseph Kambourakis​&amp;nbsp;one table is 1.5MB.  The other is about 80MB.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Feb 2023 18:50:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8773#M4314</guid>
      <dc:creator>JordanYaker</dc:creator>
      <dc:date>2023-02-24T18:50:28Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8774#M4315</link>
      <description>&lt;P&gt;Hmm, then it doesn't make sense that it would create much data on a shuffle or in any capacity.  What does the shuffle look like in the plan?  It should say data written/read in that part.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Feb 2023 18:54:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8774#M4315</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-02-24T18:54:41Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8775#M4316</link>
      <description>&lt;P&gt;Not very big.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/607i5D2D42A99B831507/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What's interesting is that this stage ran for 7hrs.  And most of that is scheduler delay.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Feb 2023 18:58:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8775#M4316</guid>
      <dc:creator>JordanYaker</dc:creator>
      <dc:date>2023-02-24T18:58:47Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8776#M4317</link>
      <description>&lt;P&gt;The input size and records looks like what you'd expect from the table sizes and it's not creating 218TB thankfully.  That said, I'm not exactly sure what the problem is in that stage, but there is def something going on w/ that length of time.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Feb 2023 19:00:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8776#M4317</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-02-24T19:00:42Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8777#M4318</link>
      <description>&lt;P&gt;I'm honestly wondering if it's just not a trick of the logic on the merge at this point.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried running a join between the output files and what would be the input to my MERGE statement. I ran an explain on that query and it ends up creating a BroadcastNestedLoopJoin. More times than not, nested loop joins have bedeviled my performance. I'm going to just try splitting the merge in to two separate calls and see if that does the trick for me.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It might just be that the explain on a MERGE doesn't show this because of how merges are executed.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Feb 2023 19:02:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8777#M4318</guid>
      <dc:creator>JordanYaker</dc:creator>
      <dc:date>2023-02-24T19:02:34Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8778#M4319</link>
      <description>&lt;P&gt;Hi @Jordan Yaker​,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! &lt;/P&gt;&lt;P&gt;Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Apr 2023 11:01:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8778#M4319</guid>
      <dc:creator>Vartika</dc:creator>
      <dc:date>2023-04-25T11:01:19Z</dc:date>
    </item>
    <item>
      <title>Re: Why is Delta Lake creating a 238.0TiB shuffle on merge?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8779#M4320</link>
      <description>&lt;P&gt;It turned out to be the BroadcastNestedLoopJoin.  Once I reworked my logic to remove that, the performance cleared up.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Apr 2023 18:42:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-is-delta-lake-creating-a-238-0tib-shuffle-on-merge/m-p/8779#M4320</guid>
      <dc:creator>JordanYaker</dc:creator>
      <dc:date>2023-04-25T18:42:03Z</dc:date>
    </item>
  </channel>
</rss>

