<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Joining huge delta tables in Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93187#M38644</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/79040"&gt;@AlokThampi&lt;/a&gt;&amp;nbsp;please use specific columns to gather stats. The command you used just gathers high level info. Check documentation for syntax to incorporate columns.&lt;/P&gt;</description>
    <pubDate>Tue, 08 Oct 2024 19:03:14 GMT</pubDate>
    <dc:creator>noorbasha534</dc:creator>
    <dc:date>2024-10-08T19:03:14Z</dc:date>
    <item>
      <title>Joining huge delta tables in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93121#M38619</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am trying to join few delta tables as per the code below.&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;SQL&lt;/SPAN&gt;&lt;SPAN&gt;Copy&lt;/SPAN&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;select&lt;/SPAN&gt; &amp;lt;applicable &lt;SPAN class=""&gt;columns&lt;/SPAN&gt;&amp;gt;
&lt;SPAN class=""&gt;FROM&lt;/SPAN&gt; ReportTable G
&lt;SPAN class=""&gt;LEFT&lt;/SPAN&gt; &lt;SPAN class=""&gt;JOIN&lt;/SPAN&gt; EKBETable EKBE &lt;SPAN class=""&gt;ON&lt;/SPAN&gt; EKBE.BELNR = G.ORDER_ID
&lt;SPAN class=""&gt;LEFT&lt;/SPAN&gt; &lt;SPAN class=""&gt;JOIN&lt;/SPAN&gt; PurchaseOrder POL &lt;SPAN class=""&gt;ON&lt;/SPAN&gt; EKBE.EBELN = POL.PO_NO&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;The PurchaseOrder table contains approximately 2 Billion records and the EKBE table contains ~500 million records. The last join (LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO) has a huge performance hit and the code keeps running for ever. There are duplicate EBELN and PO_NO values in both tables adding more heaviness to the join.&lt;/P&gt;&lt;P&gt;I have run the optimize / zorder on both the tables based on the joining keys as below but still it does't seem to work. Paritioning table isnt ideal as the joining keys are high cardinality columns.&lt;/P&gt;&lt;P&gt;EKBETable : OPTIMIZE EKBETable ZORDER BY (BELNR)&lt;/P&gt;&lt;P&gt;PurchaseOrder : OPTIMIZE PurchaseOrder ZORDER BY (PO_NO)&lt;/P&gt;&lt;P&gt;What would be the best way to optmize this join? I am using the below cluster configuration.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="AlokThampi_0-1728392939237.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11751i419FBE536CAD420F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="AlokThampi_0-1728392939237.png" alt="AlokThampi_0-1728392939237.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Alok&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 13:10:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93121#M38619</guid>
      <dc:creator>AlokThampi</dc:creator>
      <dc:date>2024-10-08T13:10:19Z</dc:date>
    </item>
    <item>
      <title>Re: Joining huge delta tables in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93122#M38620</link>
      <description>&lt;P&gt;have you tried liquid clustering on the source tables?&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 13:27:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93122#M38620</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-10-08T13:27:13Z</dc:date>
    </item>
    <item>
      <title>Re: Joining huge delta tables in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93127#M38623</link>
      <description>&lt;P&gt;hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/79040"&gt;@AlokThampi&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;it's difficult to understand what's going on here (not having access to spark UI, query profile or any idea about the dimensions of these tables) but you can give these a try:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;make sure the keys have the same data type (i.e. all three are Long or INT)&lt;/LI&gt;
&lt;LI&gt;you are using 3 tables, while I see you optimized only two of them, and reorganized the data layout using zorder.I highly suggest to use liquid clustering on all three
&lt;OL&gt;
&lt;LI&gt;the ReportTable should be clustered by ORDER_ID.&lt;/LI&gt;
&lt;LI&gt;the EKBETable should be clustered by&amp;nbsp;&lt;SPAN&gt;BELNR&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp;the&amp;nbsp;&lt;SPAN&gt;PurchaseOrder&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;should be clustered by&amp;nbsp;PO_NO.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;SPAN&gt;Let me know if these have any effect or please provide more details to be able to find the bottleneck of this join &lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 13:55:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93127#M38623</guid>
      <dc:creator>Mo</dc:creator>
      <dc:date>2024-10-08T13:55:56Z</dc:date>
    </item>
    <item>
      <title>Re: Joining huge delta tables in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93132#M38625</link>
      <description>&lt;P&gt;Not yet, but I will try that now&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 14:20:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93132#M38625</guid>
      <dc:creator>AlokThampi</dc:creator>
      <dc:date>2024-10-08T14:20:59Z</dc:date>
    </item>
    <item>
      <title>Re: Joining huge delta tables in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93135#M38626</link>
      <description>&lt;P&gt;Thanks Mo, I am yet to try liquid clustering, will do that now.&lt;/P&gt;&lt;P&gt;Also, can you please advise on what other details would you require to help me out (&lt;EM&gt;I have not used query profiler very extensively yet&lt;/EM&gt; &lt;span class="lia-unicode-emoji" title=":expressionless_face:"&gt;😑&lt;/span&gt;).&lt;/P&gt;&lt;P&gt;What cluster size would you suggest to handle this worklaod? I am using a standard 16GB, 4 Core worker nodes which scales from 4 to 8. I feel that that might be a little under powered to do the job.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 14:24:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93135#M38626</guid>
      <dc:creator>AlokThampi</dc:creator>
      <dc:date>2024-10-08T14:24:48Z</dc:date>
    </item>
    <item>
      <title>Re: Joining huge delta tables in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93152#M38630</link>
      <description>&lt;P&gt;Hi Alok, try to gather statistics for the important columns. Databricks gathers stats for the first 32 columns of the table by default. This helps in data skipping. Check all your important/frequently used columns are in first 32 positions of the delta table. If not, gather stats of all important columns manually and see if it helps.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 15:02:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93152#M38630</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2024-10-08T15:02:23Z</dc:date>
    </item>
    <item>
      <title>Re: Joining huge delta tables in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93186#M38643</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/28727"&gt;@Mo&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I tried the liquid clustering option as suggested but it still doesn't seem to work.&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":face_without_mouth:"&gt;😶&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I am assuming it to be an issue with the small cluster size that I am using.Or do you suggest any other options?&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;&amp;nbsp;, the columns that I use are in the first 32 columns and also checked for statistics but didn't work unfortunately&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;ANALYZE&lt;/SPAN&gt; &lt;SPAN&gt;TABLE&lt;/SPAN&gt;&amp;nbsp;&amp;lt;table_name&amp;gt;&amp;nbsp;&lt;SPAN&gt;COMPUTE&lt;/SPAN&gt; &lt;SPAN&gt;STATISTICS&lt;/SPAN&gt;&lt;SPAN&gt;;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 18:57:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93186#M38643</guid>
      <dc:creator>AlokThampi</dc:creator>
      <dc:date>2024-10-08T18:57:47Z</dc:date>
    </item>
    <item>
      <title>Re: Joining huge delta tables in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93187#M38644</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/79040"&gt;@AlokThampi&lt;/a&gt;&amp;nbsp;please use specific columns to gather stats. The command you used just gathers high level info. Check documentation for syntax to incorporate columns.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 19:03:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/joining-huge-delta-tables-in-databricks/m-p/93187#M38644</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2024-10-08T19:03:14Z</dc:date>
    </item>
  </channel>
</rss>

