Databricks Community

rocky5 · ‎04-06-2024

I wrote simple code:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, max
import pyspark.sql.functions as F

streaming_data = spark.read.table("x")
window = Window.partitionBy("BK_AccountApplicationId").orderBy(F.col("Onboarding_External_LakehouseId").desc())
test = streaming_data.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId").distinct()
test1 = test.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId",
                     F.row_number().over(window).alias("row_num")).show(20,truncate=False)

I am surprised when I can see below results:

+------------------------------------+-------------------------------+-------+
|BK_AccountApplicationId             |Onboarding_External_LakehouseId|row_num|
+------------------------------------+-------------------------------+-------+
|abcd0001-5775-4f93-a39a-eefb29cd8ffe|2                              |1      |
|abcd0002-5775-4f93-a39a-eefb29cd8ffe|3                              |1      |
|abcd0003-5775-4f93-a39a-eefb29cd8ffe|4                              |1      |
|abcd0004-5775-4f93-a39a-eefb29cd8ffe|5                              |1      |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7                              |1      |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6                              |2      |
|abcd0006-5775-4f93-a39a-eefb29cd8ffe|8                              |1      |
|abcd0007-5775-4f93-a39a-eefb29cd8ffe|9                              |1      |
|abcd0008-5775-4f93-a39a-eefb29cd8ffe|12                             |1      |
|abcd0008-5775-4f93-a39a-eefb29cd8ffe|11                             |2      |
+------------------------------------+-------------------------------+-------+

so, for one BK_AccountApplicationId you can see multiple LakehouseIDs - why multiple rows with LakehouseId lower than 12 has row_num=1 - does anyone is able to explain mi that?

ThomazRossito · ‎04-07-2024

Hi,

In my opinion the result is correct
What needs to be noted in the result is that it is sorted by the "Onboarding_External_LakehouseId" column so if there is "BK_AccountApplicationId" with the same code, it will be partitioned into 2 row_numbers

Just like in the example below:
Here there are 2 BK_AccountApplicationId, equal, then there are 2 row_number, the most recent (or greatest) row_number being. "Onboarding_External_LakehouseId" is equal to 7, which is why its row_number is 1

|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 |

Att.
Thomaz Antonio Rossito Neto
Master Data Specialist - Data Architect | Data Engineer @ CI&T

View solution in original post

ThomazRossito · ‎04-07-2024

Hi,

In my opinion the result is correct
What needs to be noted in the result is that it is sorted by the "Onboarding_External_LakehouseId" column so if there is "BK_AccountApplicationId" with the same code, it will be partitioned into 2 row_numbers

Just like in the example below:
Here there are 2 BK_AccountApplicationId, equal, then there are 2 row_number, the most recent (or greatest) row_number being. "Onboarding_External_LakehouseId" is equal to 7, which is why its row_number is 1

|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 |

Att.
Thomaz Antonio Rossito Neto
Master Data Specialist - Data Architect | Data Engineer @ CI&T