<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to avoid iteration/loop in databricks in the given scenario in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/89148#M37704</link>
    <description>&lt;P&gt;Thank you so much for the link.&lt;/P&gt;</description>
    <pubDate>Mon, 09 Sep 2024 07:51:14 GMT</pubDate>
    <dc:creator>AnnieWhite</dc:creator>
    <dc:date>2024-09-09T07:51:14Z</dc:date>
    <item>
      <title>How to avoid iteration/loop in databricks in the given scenario</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81308#M36321</link>
      <description>&lt;P&gt;Hi all, I need your input.&lt;/P&gt;&lt;P&gt;I am new to Databricks and working with a dataset that consists of around 10,000 systems, each containing approximately 100 to 150 parts. These parts have attributes such as name, version, and serial number. The dataset size is around 10 million records.&lt;/P&gt;&lt;P&gt;The task is to retrieve the parts associated with each system for a particular day and compare them with the parts from the previous day. If there are any changes, such as modifications, deletions, or additions, I need to create a union of all these changes and store the result in Databricks. This updated dataset will then be used for comparison with the parts available on the next day, and the process repeats.&lt;/P&gt;&lt;P&gt;Below, I've provided an example of the input and expected output for a particular system.&lt;/P&gt;&lt;P&gt;Currently, I'm iterating over days for each system and calculating the deltas using left anti and inner joins in each iteration. However, this process is taking a considerable amount of time in Databricks.&lt;/P&gt;&lt;P&gt;Is there a more efficient way to handle this without the iterative approach? I would appreciate any suggestions.&lt;/P&gt;&lt;P&gt;Input:&lt;/P&gt;&lt;TABLE border="0" width="435" cellspacing="0" cellpadding="0"&gt;&lt;COLGROUP&gt;&lt;COL width="87" /&gt;&lt;/COLGROUP&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="87" height="21"&gt;system&lt;/TD&gt;&lt;TD width="87"&gt;logDate&lt;/TD&gt;&lt;TD width="87"&gt;name&lt;/TD&gt;&lt;TD width="87"&gt;version&lt;/TD&gt;&lt;TD width="87"&gt;serial&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/24/24&lt;/TD&gt;&lt;TD&gt;name1&lt;/TD&gt;&lt;TD&gt;version1&lt;/TD&gt;&lt;TD&gt;number1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/24/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/25/24&lt;/TD&gt;&lt;TD&gt;name1&lt;/TD&gt;&lt;TD&gt;version1&lt;/TD&gt;&lt;TD&gt;number1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/25/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/26/24&lt;/TD&gt;&lt;TD&gt;name1&lt;/TD&gt;&lt;TD&gt;version1&lt;/TD&gt;&lt;TD&gt;number1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/26/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/26/24&lt;/TD&gt;&lt;TD&gt;name3&lt;/TD&gt;&lt;TD&gt;version3&lt;/TD&gt;&lt;TD&gt;number3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/27/24&lt;/TD&gt;&lt;TD&gt;name1&lt;/TD&gt;&lt;TD&gt;version1&lt;/TD&gt;&lt;TD&gt;number1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/27/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2.1&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/27/24&lt;/TD&gt;&lt;TD&gt;name3&lt;/TD&gt;&lt;TD&gt;version3&lt;/TD&gt;&lt;TD&gt;number3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/30/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2.1&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/30/24&lt;/TD&gt;&lt;TD&gt;name3&lt;/TD&gt;&lt;TD&gt;version3&lt;/TD&gt;&lt;TD&gt;number3.1&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;Output;&lt;/P&gt;&lt;TABLE border="0" width="435" cellspacing="0" cellpadding="0"&gt;&lt;COLGROUP&gt;&lt;COL width="87" /&gt;&lt;/COLGROUP&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="87" height="21"&gt;system&lt;/TD&gt;&lt;TD width="87"&gt;logDate&lt;/TD&gt;&lt;TD width="87"&gt;name&lt;/TD&gt;&lt;TD width="87"&gt;version&lt;/TD&gt;&lt;TD width="87"&gt;serial&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/24/24&lt;/TD&gt;&lt;TD&gt;name1&lt;/TD&gt;&lt;TD&gt;version1&lt;/TD&gt;&lt;TD&gt;number1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/24/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/26/24&lt;/TD&gt;&lt;TD&gt;name1&lt;/TD&gt;&lt;TD&gt;version1&lt;/TD&gt;&lt;TD&gt;number1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/26/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/26/24&lt;/TD&gt;&lt;TD&gt;name3&lt;/TD&gt;&lt;TD&gt;version3&lt;/TD&gt;&lt;TD&gt;number3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/27/24&lt;/TD&gt;&lt;TD&gt;name1&lt;/TD&gt;&lt;TD&gt;version1&lt;/TD&gt;&lt;TD&gt;number1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/27/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2.1&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/27/24&lt;/TD&gt;&lt;TD&gt;name3&lt;/TD&gt;&lt;TD&gt;version3&lt;/TD&gt;&lt;TD&gt;number3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/30/24&lt;/TD&gt;&lt;TD&gt;name1&lt;/TD&gt;&lt;TD&gt;version1&lt;/TD&gt;&lt;TD&gt;number1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/30/24&lt;/TD&gt;&lt;TD&gt;name2&lt;/TD&gt;&lt;TD&gt;version2.1&lt;/TD&gt;&lt;TD&gt;number2&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="21"&gt;12345678&lt;/TD&gt;&lt;TD&gt;6/30/24&lt;/TD&gt;&lt;TD&gt;name3&lt;/TD&gt;&lt;TD&gt;version3&lt;/TD&gt;&lt;TD&gt;number3.1&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Wed, 31 Jul 2024 12:21:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81308#M36321</guid>
      <dc:creator>shri0509</dc:creator>
      <dc:date>2024-07-31T12:21:12Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid iteration/loop in databricks in the given scenario</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81409#M36322</link>
      <description>&lt;P&gt;HI &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/27"&gt;@ogdendc&lt;/a&gt; ,&lt;/P&gt;&lt;P&gt;I have read your your beware of for loop, but in the scenario described above how to tackle it.&lt;/P&gt;&lt;P&gt;Please guide.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Aug 2024 05:14:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81409#M36322</guid>
      <dc:creator>shri0509</dc:creator>
      <dc:date>2024-08-01T05:14:23Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid iteration/loop in databricks in the given scenario</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81410#M36323</link>
      <description>&lt;P&gt;Have tried Window function with lag, however the data i have in not sequencial, what i mean is if i have on Jan 1, then next data be on Jan 5, then Jan 20 and so on. With window function i am able to get modified, common however data deleted is getting missed.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Aug 2024 05:20:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81410#M36323</guid>
      <dc:creator>shri0509</dc:creator>
      <dc:date>2024-08-01T05:20:16Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid iteration/loop in databricks in the given scenario</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81477#M36324</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;any suggestion..&lt;/P&gt;</description>
      <pubDate>Thu, 01 Aug 2024 11:00:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81477#M36324</guid>
      <dc:creator>shri0509</dc:creator>
      <dc:date>2024-08-01T11:00:47Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid iteration/loop in databricks in the given scenario</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81482#M36325</link>
      <description>&lt;P&gt;I'm not sure I fully understand the question, you are right to avoid loops as they are very inefficient in spark. Some techniques that might help you:&lt;/P&gt;
&lt;H4&gt;Self join using date_add&lt;/H4&gt;
&lt;P&gt;Join the table to itself on system identifiers, but the date-1, ie&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;select T1.*, T2.log_date as log_date_yest,&amp;nbsp;T2.serial as serial_yest&lt;BR /&gt;from yourTable T1&lt;BR /&gt;inner join yourTable T2&lt;BR /&gt;on T1.system = T2.system&lt;BR /&gt;and T1.log_date = date_add(T2.log_date, -1)&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;date add docs:&amp;nbsp;&lt;A href="https://docs.databricks.com/en/sql/language-manual/functions/date_add.html" target="_blank"&gt;https://docs.databricks.com/en/sql/language-manual/functions/date_add.html&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;H4&gt;Merge into&lt;/H4&gt;
&lt;P&gt;If you wanted a table with only the latest values in it, you could use MERGE INTO. Say you wanted a table that only reflected the latest values, the code might look something like:&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN class="n"&gt;MERGE&lt;/SPAN&gt; &lt;SPAN class="k"&gt;INTO&lt;/SPAN&gt; yourTable &lt;SPAN class="k"&gt;USING&lt;/SPAN&gt; &lt;SPAN class="k"&gt;newData&lt;/SPAN&gt;
  &lt;SPAN class="k"&gt;ON&lt;/SPAN&gt; yourTable&lt;SPAN class="p"&gt;.&lt;SPAN&gt;system&lt;/SPAN&gt;&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; &lt;SPAN class="k"&gt;newData&lt;/SPAN&gt;&lt;SPAN class="p"&gt;.&lt;SPAN&gt;system &lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN class="k"&gt;  AND yourTable&lt;SPAN class="p"&gt;.&lt;SPAN&gt;logDate&lt;/SPAN&gt;&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; newData&lt;SPAN class="p"&gt;.&lt;SPAN&gt;logDate&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;  &lt;SPAN class="k"&gt;AND yourTable&lt;SPAN class="p"&gt;.&lt;SPAN&gt;name&lt;/SPAN&gt;&lt;/SPAN&gt; &lt;SPAN class="o"&gt;=&lt;/SPAN&gt; newData&lt;SPAN class="p"&gt;.&lt;SPAN&gt;name&lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;  WHEN NOT MATCHED THEN INSERT *
  &lt;SPAN class="k"&gt;WHEN&lt;/SPAN&gt; &lt;SPAN class="n"&gt;MATCHED&lt;/SPAN&gt; &lt;SPAN class="k"&gt;AND&lt;/SPAN&gt; &lt;SPAN class="k"&gt;yourTable&lt;/SPAN&gt;&lt;SPAN class="p"&gt;.&lt;/SPAN&gt;&lt;SPAN class="n"&gt;version&lt;/SPAN&gt; &amp;gt; &lt;SPAN class="k"&gt;newData&lt;/SPAN&gt;&lt;SPAN class="p"&gt;.&lt;SPAN class="n"&gt;version&lt;/SPAN&gt;&lt;/SPAN&gt; &lt;SPAN class="k"&gt;THEN&lt;/SPAN&gt; &lt;SPAN class="k"&gt;UPDATE&lt;/SPAN&gt; &lt;SPAN class="k"&gt;SET&lt;/SPAN&gt; &lt;SPAN class="o"&gt;* --you'd have to create logic to define what data is new  &lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;Much better examples here:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html" target="_blank"&gt;https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html&lt;/A&gt;&lt;/P&gt;
&lt;H4&gt;Language&lt;/H4&gt;
&lt;P&gt;These are in SQL, you could rewrite them in python it won't have much impact on the performance.&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;Tuning&lt;/H4&gt;
&lt;P&gt;Hopefully, this will run much faster. As for right sizing your cluster you have two approaches:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;learn spark internals. learn detailed VM infrastructure. follow this giant guide &lt;A href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide" target="_blank"&gt;https://www.databricks.com/discover/pages/optimize-data-workloads-guide&lt;/A&gt;. Time estimate: 100+ hours&lt;/LI&gt;
&lt;LI&gt;use serverless. Time estimate: 5 seconds.&amp;nbsp;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Hope this is useful! Let us know how you get on. Holly&lt;/P&gt;</description>
      <pubDate>Thu, 01 Aug 2024 12:23:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/81482#M36325</guid>
      <dc:creator>holly</dc:creator>
      <dc:date>2024-08-01T12:23:24Z</dc:date>
    </item>
    <item>
      <title>Re: How to avoid iteration/loop in databricks in the given scenario</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/89148#M37704</link>
      <description>&lt;P&gt;Thank you so much for the link.&lt;/P&gt;</description>
      <pubDate>Mon, 09 Sep 2024 07:51:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-iteration-loop-in-databricks-in-the-given-scenario/m-p/89148#M37704</guid>
      <dc:creator>AnnieWhite</dc:creator>
      <dc:date>2024-09-09T07:51:14Z</dc:date>
    </item>
  </channel>
</rss>

