<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Row_Num function in spark-sql in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/row-num-function-in-spark-sql/m-p/34260#M25030</link>
    <description>&lt;P&gt;I have a doubt row_num with order by in  spark-sql&amp;nbsp;gives different result(non-deterministic output) every&amp;nbsp;time i execute it?&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;It's due to parallelism in spark ?​&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Any approach how to takle it?&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I order by with a date column and a integer column and take minimum row_num in the further down stream process.&lt;/P&gt;</description>
    <pubDate>Thu, 18 Aug 2022 18:25:47 GMT</pubDate>
    <dc:creator>Hemant</dc:creator>
    <dc:date>2022-08-18T18:25:47Z</dc:date>
    <item>
      <title>Row_Num function in spark-sql</title>
      <link>https://community.databricks.com/t5/data-engineering/row-num-function-in-spark-sql/m-p/34260#M25030</link>
      <description>&lt;P&gt;I have a doubt row_num with order by in  spark-sql&amp;nbsp;gives different result(non-deterministic output) every&amp;nbsp;time i execute it?&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;It's due to parallelism in spark ?​&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Any approach how to takle it?&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I order by with a date column and a integer column and take minimum row_num in the further down stream process.&lt;/P&gt;</description>
      <pubDate>Thu, 18 Aug 2022 18:25:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/row-num-function-in-spark-sql/m-p/34260#M25030</guid>
      <dc:creator>Hemant</dc:creator>
      <dc:date>2022-08-18T18:25:47Z</dc:date>
    </item>
    <item>
      <title>Re: Row_Num function in spark-sql</title>
      <link>https://community.databricks.com/t5/data-engineering/row-num-function-in-spark-sql/m-p/34261#M25031</link>
      <description>&lt;P&gt;@Hemant Kumar​&amp;nbsp;- could you please try to coalesce to a single partition in order to generate continuous output? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Aug 2022 22:58:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/row-num-function-in-spark-sql/m-p/34261#M25031</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2022-08-19T22:58:30Z</dc:date>
    </item>
    <item>
      <title>Re: Row_Num function in spark-sql</title>
      <link>https://community.databricks.com/t5/data-engineering/row-num-function-in-spark-sql/m-p/37504#M26375</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/21470"&gt;@Hemant&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If the order by clause provided yields a unique result, then we would get deterministic output.&amp;nbsp;&lt;/P&gt;&lt;P&gt;For ex:&lt;/P&gt;&lt;P&gt;If we create a rowID for this dataset, with CustomerID used in OrderBy clause, then depending upon the runtime, we may get non-deterministic output for CustomerID 100, because any record may get picked up for Row Number 1.&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="33.333333333333336%"&gt;CustomerID&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;SalesID&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;OrderDateTime&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="33.333333333333336%"&gt;100&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;900&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;2023-07-08 10:00:00&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="33.333333333333336%"&gt;100&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;901&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;2023-07-09 10:00:00&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;101&lt;/TD&gt;&lt;TD&gt;902&lt;/TD&gt;&lt;TD&gt;2023-07-09 10:00:00&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;In cases like this, it is recommended to use a column to define the ordering in a deterministic way. The usage of OrderDateTime column in the OrderBy clause will help us to achieve a deterministic output.&lt;/P&gt;&lt;P&gt;The same is also documented in the official documentation -&amp;nbsp;&lt;A href="https://docs.databricks.com/sql/language-manual/functions/row_number.html#:~:text=If%20the%20order%20is%20not%20unique%2C%20the%20result%20is%20non%2Ddeterministic" target="_blank"&gt;https://docs.databricks.com/sql/language-manual/functions/row_number.html#:~:text=If%20the%20order%20is%20not%20unique%2C%20the%20result%20is%20non%2Ddeterministic&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Jul 2023 12:25:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/row-num-function-in-spark-sql/m-p/37504#M26375</guid>
      <dc:creator>Tharun-Kumar</dc:creator>
      <dc:date>2023-07-12T12:25:09Z</dc:date>
    </item>
  </channel>
</rss>

