Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

In the relational model, a table can be seen as a collection of rows, which are defined by a schema. It is important to note that collections or sets have no inherent order, meaning the rows are not arranged in a specific sequence. To obtain an ordered list for a particular query, SQL offers the ORDER BY operation, which is an extra step that converts the unordered collection into a sorted format.

On a distributed engine like Spark/Databricks....Data is spread across many files and partitions. Different tasks read different chunks in parallel. The engine can read partitions in any order, shuffle data for joins/aggregations, and repartition files on OPTIMIZE/compaction.

If Databricks had to preserve a specific global row order for every table, then it would have to serialise a lot of work (kill parallelism). The optimiser couldn’t freely reorder operations, and many physical optimisations (shuffles, repartitions, file compaction) would become impossible or extremely expensive.

This is no different to other relational databases, by the way. I come from an Oracle background. You may find this and this interesting.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***