Databricks Community

Hemant · ‎08-18-2022

I have a doubt row_num with order by in spark-sql gives different result(non-deterministic output) every time i execute it?

It's due to parallelism in spark ?

Any approach how to takle it?

I order by with a date column and a integer column and take minimum row_num in the further down stream process.

Hemant Soni

shan_chandra · ‎08-19-2022

@Hemant Kumar - could you please try to coalesce to a single partition in order to generate continuous output?

Tharun-Kumar · ‎07-12-2023

@Hemant

If the order by clause provided yields a unique result, then we would get deterministic output.

For ex:

If we create a rowID for this dataset, with CustomerID used in OrderBy clause, then depending upon the runtime, we may get non-deterministic output for CustomerID 100, because any record may get picked up for Row Number 1.

CustomerID	SalesID	OrderDateTime
100	900	2023-07-08 10:00:00
100	901	2023-07-09 10:00:00
101	902	2023-07-09 10:00:00

In cases like this, it is recommended to use a column to define the ordering in a deterministic way. The usage of OrderDateTime column in the OrderBy clause will help us to achieve a deterministic output.

The same is also documented in the official documentation - https://docs.databricks.com/sql/language-manual/functions/row_number.html#:~:text=If%20the%20order%2....