Incorrect results of row_number() function

rocky5 — Sat, 06 Apr 2024 10:28:59 GMT

I wrote simple code:

from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import row_number, max import pyspark.sql.functions as F streaming_data = spark.read.table("x") window = Window.partitionBy("BK_AccountApplicationId").orderBy(F.col("Onboarding_External_LakehouseId").desc()) test = streaming_data.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId").distinct() test1 = test.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId", F.row_number().over(window).alias("row_num")).show(20,truncate=False)

I am surprised when I can see below results:

+------------------------------------+-------------------------------+-------+ |BK_AccountApplicationId |Onboarding_External_LakehouseId|row_num| +------------------------------------+-------------------------------+-------+ |abcd0001-5775-4f93-a39a-eefb29cd8ffe|2 |1 | |abcd0002-5775-4f93-a39a-eefb29cd8ffe|3 |1 | |abcd0003-5775-4f93-a39a-eefb29cd8ffe|4 |1 | |abcd0004-5775-4f93-a39a-eefb29cd8ffe|5 |1 | |abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 | |abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 | |abcd0006-5775-4f93-a39a-eefb29cd8ffe|8 |1 | |abcd0007-5775-4f93-a39a-eefb29cd8ffe|9 |1 | |abcd0008-5775-4f93-a39a-eefb29cd8ffe|12 |1 | |abcd0008-5775-4f93-a39a-eefb29cd8ffe|11 |2 | +------------------------------------+-------------------------------+-------+

so, for one BK_AccountApplicationId you can see multiple LakehouseIDs - why multiple rows with LakehouseId lower than 12 has row_num=1 - does anyone is able to explain mi that?

Re: Incorrect results of row_number() function

ThomazRossito — Sun, 07 Apr 2024 17:46:16 GMT

Hi,

In my opinion the result is correct
What needs to be noted in the result is that it is sorted by the "Onboarding_External_LakehouseId" column so if there is "BK_AccountApplicationId" with the same code, it will be partitioned into 2 row_numbers

Just like in the example below:
Here there are 2 BK_AccountApplicationId, equal, then there are 2 row_number, the most recent (or greatest) row_number being. "Onboarding_External_LakehouseId" is equal to 7, which is why its row_number is 1

|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 |

topic Re: Incorrect results of row_number() function in Warehousing & Analytics

Incorrect results of row_number() function

Re: Incorrect results of row_number() function