I wrote simple code:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, max
import pyspark.sql.functions as F
streaming_data = spark.read.table("x")
window = Window.partitionBy("BK_AccountApplicationId").orderBy(F.col("Onboarding_External_LakehouseId").desc())
test = streaming_data.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId").distinct()
test1 = test.select("BK_AccountApplicationId", "Onboarding_External_LakehouseId",
F.row_number().over(window).alias("row_num")).show(20,truncate=False)
I am surprised when I can see below results:
+------------------------------------+-------------------------------+-------+
|BK_AccountApplicationId |Onboarding_External_LakehouseId|row_num|
+------------------------------------+-------------------------------+-------+
|abcd0001-5775-4f93-a39a-eefb29cd8ffe|2 |1 |
|abcd0002-5775-4f93-a39a-eefb29cd8ffe|3 |1 |
|abcd0003-5775-4f93-a39a-eefb29cd8ffe|4 |1 |
|abcd0004-5775-4f93-a39a-eefb29cd8ffe|5 |1 |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|7 |1 |
|abcd0005-5775-4f93-a39a-eefb29cd8ffe|6 |2 |
|abcd0006-5775-4f93-a39a-eefb29cd8ffe|8 |1 |
|abcd0007-5775-4f93-a39a-eefb29cd8ffe|9 |1 |
|abcd0008-5775-4f93-a39a-eefb29cd8ffe|12 |1 |
|abcd0008-5775-4f93-a39a-eefb29cd8ffe|11 |2 |
+------------------------------------+-------------------------------+-------+
so, for one BK_AccountApplicationId you can see multiple LakehouseIDs - why multiple rows with LakehouseId lower than 12 has row_num=1 - does anyone is able to explain mi that?