<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Incorrect results of row_number() function in Warehousing &amp; Analytics</title>
    <link>https://community.databricks.com/t5/warehousing-analytics/incorrect-results-of-row-number-function/m-p/65731#M1270</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;In my opinion the result is correct&lt;BR /&gt;What needs to be noted in the result is that it is sorted by the "Onboarding_External_LakehouseId" column so if there is "BK_AccountApplicationId" with the same code, it will be partitioned into 2 row_numbers&lt;/P&gt;&lt;P&gt;Just like in the example below:&lt;BR /&gt;Here there are 2 BK_AccountApplicationId, equal, then there are 2 row_number, the most recent (or greatest) row_number being. "Onboarding_External_LakehouseId" is equal to 7, which is why its row_number is 1&lt;/P&gt;&lt;P&gt;|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 |&lt;BR /&gt;|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 |&lt;/P&gt;</description>
    <pubDate>Sun, 07 Apr 2024 17:46:16 GMT</pubDate>
    <dc:creator>ThomazRossito</dc:creator>
    <dc:date>2024-04-07T17:46:16Z</dc:date>
    <item>
      <title>Incorrect results of row_number() function</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/incorrect-results-of-row-number-function/m-p/65695#M1267</link>
      <description>&lt;P&gt;I wrote simple code:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, max
import pyspark.sql.functions as F

streaming_data = spark.read.table("x")
window = Window.partitionBy("BK_AccountApplicationId").orderBy(F.col("Onboarding_External_LakehouseId").desc())
test = streaming_data.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId").distinct()
test1 = test.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId",
                     F.row_number().over(window).alias("row_num")).show(20,truncate=False)&lt;/LI-CODE&gt;&lt;P&gt;I am surprised when I can see below results:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;+------------------------------------+-------------------------------+-------+
|BK_AccountApplicationId             |Onboarding_External_LakehouseId|row_num|
+------------------------------------+-------------------------------+-------+
|abcd0001-5775-4f93-a39a-eefb29cd8ffe|2                              |1      |
|abcd0002-5775-4f93-a39a-eefb29cd8ffe|3                              |1      |
|abcd0003-5775-4f93-a39a-eefb29cd8ffe|4                              |1      |
|abcd0004-5775-4f93-a39a-eefb29cd8ffe|5                              |1      |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7                              |1      |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6                              |2      |
|abcd0006-5775-4f93-a39a-eefb29cd8ffe|8                              |1      |
|abcd0007-5775-4f93-a39a-eefb29cd8ffe|9                              |1      |
|abcd0008-5775-4f93-a39a-eefb29cd8ffe|12                             |1      |
|abcd0008-5775-4f93-a39a-eefb29cd8ffe|11                             |2      |
+------------------------------------+-------------------------------+-------+&lt;/LI-CODE&gt;&lt;P&gt;so, for one&amp;nbsp;BK_AccountApplicationId&amp;nbsp; you can see multiple LakehouseIDs - why multiple rows with LakehouseId lower than 12 has row_num=1 - does anyone is able to explain mi that?&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2024 10:28:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/incorrect-results-of-row-number-function/m-p/65695#M1267</guid>
      <dc:creator>rocky5</dc:creator>
      <dc:date>2024-04-06T10:28:59Z</dc:date>
    </item>
    <item>
      <title>Re: Incorrect results of row_number() function</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/incorrect-results-of-row-number-function/m-p/65731#M1270</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;In my opinion the result is correct&lt;BR /&gt;What needs to be noted in the result is that it is sorted by the "Onboarding_External_LakehouseId" column so if there is "BK_AccountApplicationId" with the same code, it will be partitioned into 2 row_numbers&lt;/P&gt;&lt;P&gt;Just like in the example below:&lt;BR /&gt;Here there are 2 BK_AccountApplicationId, equal, then there are 2 row_number, the most recent (or greatest) row_number being. "Onboarding_External_LakehouseId" is equal to 7, which is why its row_number is 1&lt;/P&gt;&lt;P&gt;|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 |&lt;BR /&gt;|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 |&lt;/P&gt;</description>
      <pubDate>Sun, 07 Apr 2024 17:46:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/incorrect-results-of-row-number-function/m-p/65731#M1270</guid>
      <dc:creator>ThomazRossito</dc:creator>
      <dc:date>2024-04-07T17:46:16Z</dc:date>
    </item>
  </channel>
</rss>

