topic Row_Num function in spark-sql in Data Engineering

Row_Num function in spark-sql

Hemant — Thu, 18 Aug 2022 18:25:47 GMT

I have a doubt row_num with order by in spark-sql gives different result(non-deterministic output) every time i execute it?

It's due to parallelism in spark ?

Any approach how to takle it?

I order by with a date column and a integer column and take minimum row_num in the further down stream process.

Re: Row_Num function in spark-sql

shan_chandra — Fri, 19 Aug 2022 22:58:30 GMT

@Hemant Kumar - could you please try to coalesce to a single partition in order to generate continuous output?

Re: Row_Num function in spark-sql

Tharun-Kumar — Wed, 12 Jul 2023 12:25:09 GMT

@Hemant

If the order by clause provided yields a unique result, then we would get deterministic output.

For ex:

If we create a rowID for this dataset, with CustomerID used in OrderBy clause, then depending upon the runtime, we may get non-deterministic output for CustomerID 100, because any record may get picked up for Row Number 1.

CustomerID	SalesID	OrderDateTime
100	900	2023-07-08 10:00:00
100	901	2023-07-09 10:00:00
101	902	2023-07-09 10:00:00

In cases like this, it is recommended to use a column to define the ordering in a deterministic way. The usage of OrderDateTime column in the OrderBy clause will help us to achieve a deterministic output.

The same is also documented in the official documentation - https://docs.databricks.com/sql/language-manual/functions/row_number.html#:~:text=If%20the%20order%20is%20not%20unique%2C%20the%20result%20is%20non%2Ddeterministic.