cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Incorrect results of row_number() function

rocky5
New Contributor III

I wrote simple code:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, max
import pyspark.sql.functions as F

streaming_data = spark.read.table("x")
window = Window.partitionBy("BK_AccountApplicationId").orderBy(F.col("Onboarding_External_LakehouseId").desc())
test = streaming_data.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId").distinct()
test1 = test.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId",
                     F.row_number().over(window).alias("row_num")).show(20,truncate=False)

I am surprised when I can see below results:

+------------------------------------+-------------------------------+-------+
|BK_AccountApplicationId             |Onboarding_External_LakehouseId|row_num|
+------------------------------------+-------------------------------+-------+
|abcd0001-5775-4f93-a39a-eefb29cd8ffe|2                              |1      |
|abcd0002-5775-4f93-a39a-eefb29cd8ffe|3                              |1      |
|abcd0003-5775-4f93-a39a-eefb29cd8ffe|4                              |1      |
|abcd0004-5775-4f93-a39a-eefb29cd8ffe|5                              |1      |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7                              |1      |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6                              |2      |
|abcd0006-5775-4f93-a39a-eefb29cd8ffe|8                              |1      |
|abcd0007-5775-4f93-a39a-eefb29cd8ffe|9                              |1      |
|abcd0008-5775-4f93-a39a-eefb29cd8ffe|12                             |1      |
|abcd0008-5775-4f93-a39a-eefb29cd8ffe|11                             |2      |
+------------------------------------+-------------------------------+-------+

so, for one BK_AccountApplicationId  you can see multiple LakehouseIDs - why multiple rows with LakehouseId lower than 12 has row_num=1 - does anyone is able to explain mi that?

1 ACCEPTED SOLUTION

Accepted Solutions

ThomazRossito
Contributor

Hi,

In my opinion the result is correct
What needs to be noted in the result is that it is sorted by the "Onboarding_External_LakehouseId" column so if there is "BK_AccountApplicationId" with the same code, it will be partitioned into 2 row_numbers

Just like in the example below:
Here there are 2 BK_AccountApplicationId, equal, then there are 2 row_number, the most recent (or greatest) row_number being. "Onboarding_External_LakehouseId" is equal to 7, which is why its row_number is 1

|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 |

Att.
Thomaz Antonio Rossito Neto
Data Technical Lead / Data Engineer-Architect

View solution in original post

1 REPLY 1

ThomazRossito
Contributor

Hi,

In my opinion the result is correct
What needs to be noted in the result is that it is sorted by the "Onboarding_External_LakehouseId" column so if there is "BK_AccountApplicationId" with the same code, it will be partitioned into 2 row_numbers

Just like in the example below:
Here there are 2 BK_AccountApplicationId, equal, then there are 2 row_number, the most recent (or greatest) row_number being. "Onboarding_External_LakehouseId" is equal to 7, which is why its row_number is 1

|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 |

Att.
Thomaz Antonio Rossito Neto
Data Technical Lead / Data Engineer-Architect

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group