Row_Num function in spark-sql
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-18-2022 11:25 AM
I have a doubt row_num with order by in spark-sql gives different result(non-deterministic output) every time i execute it?
It's due to parallelism in spark ?
Any approach how to takle it?
I order by with a date column and a integer column and take minimum row_num in the further down stream process.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-19-2022 03:58 PM
@Hemant Kumar - could you please try to coalesce to a single partition in order to generate continuous output?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-12-2023 05:25 AM
If the order by clause provided yields a unique result, then we would get deterministic output.
For ex:
If we create a rowID for this dataset, with CustomerID used in OrderBy clause, then depending upon the runtime, we may get non-deterministic output for CustomerID 100, because any record may get picked up for Row Number 1.
CustomerID | SalesID | OrderDateTime |
100 | 900 | 2023-07-08 10:00:00 |
100 | 901 | 2023-07-09 10:00:00 |
101 | 902 | 2023-07-09 10:00:00 |
In cases like this, it is recommended to use a column to define the ordering in a deterministic way. The usage of OrderDateTime column in the OrderBy clause will help us to achieve a deterministic output.
The same is also documented in the official documentation - https://docs.databricks.com/sql/language-manual/functions/row_number.html#:~:text=If%20the%20order%2....

