Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

You’re not missing anything fundamental. You’re just looking at two different levels:

df.orderBy(col) is just a transformation that adds a global sort node on top of the plan. When you call display() (or show()), Spark plans a job whose final stage is a shuffle + sort that enforces the requested ordering across all partitions. The driver/notebook UI consumes the final, globally sorted result set and displays it to you.

Semantically, this is:

SELECT *
FROM df
ORDER BY col;
This is an outermost ORDER BY on a SELECT, so the engine is required to return rows in that order.

Parallelism isn’t killed. Spark still reads partitions in parallel, shuffles by sort key, sorts within partitions, and then streams the sorted partitions in key order to the client. You pay the cost of a distributed sort, but you still get a globally ordered result.

However, when you do something like the below... 

CREATE OR REPLACE MATERIALIZED VIEW my_mv AS
SELECT ...
FROM v1 JOIN v2 ...
ORDER BY some_col;
You’re not saying "always return results sorted". You’re just saying, "Define a table-like object whose contents come from this query."

In relational theory, tables (and MVs) are unordered collections. Physical file order is not part of the contract. The engine is free to drop the ORDER BY when planning the MV refresh, incrementally update only changed partitions, recluster or compact files later, and read/write partitions in any order.

Especially with incremental refresh, it’s basically impossible to maintain a single global sort order cheaply every time new data arrives. Enzyme’s job is to keep the MV logically correct, not physically sorted. So even if the initial full refresh happens to write files in sorted order, later different refresh strategies (ROW_BASED, PARTITION_OVERWRITE, APPEND_ONLY, etc.), compaction, and different execution plans can and will scramble physical order. The contract for the MV is SELECT * FROM my_mv returns the right rows, but in unspecified order unless you ORDER BY in that SELECT.

Thats why you should always sort at query time, never rely on storage order, whether it’s a Delta table, MV, or view.

Hope this helps.

 

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***