cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Row_Num function in spark-sql

Hemant
Valued Contributor II

I have a doubt row_num with order by in spark-sql gives different result(non-deterministic output) every time i execute it?

It's due to parallelism in spark ?​

Any approach how to takle it?

I order by with a date column and a integer column and take minimum row_num in the further down stream process.

Hemant Soni
2 REPLIES 2

shan_chandra
Esteemed Contributor
Esteemed Contributor

@Hemant Kumar​ - could you please try to coalesce to a single partition in order to generate continuous output?

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@Hemant 

If the order by clause provided yields a unique result, then we would get deterministic output. 

For ex:

If we create a rowID for this dataset, with CustomerID used in OrderBy clause, then depending upon the runtime, we may get non-deterministic output for CustomerID 100, because any record may get picked up for Row Number 1.

CustomerIDSalesIDOrderDateTime
1009002023-07-08 10:00:00
1009012023-07-09 10:00:00
1019022023-07-09 10:00:00

In cases like this, it is recommended to use a column to define the ordering in a deterministic way. The usage of OrderDateTime column in the OrderBy clause will help us to achieve a deterministic output.

The same is also documented in the official documentation - https://docs.databricks.com/sql/language-manual/functions/row_number.html#:~:text=If%20the%20order%2....