cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

OrderBy is not sorting the results

IM_01
Contributor II

Hi,

I am currently using Lakeflow SDP ,

firstly I am creating 2 views and then joining them and creating materialized view and using order by in the materialized view create function , but the results are not sorted 
does order by not work on materialized view. Could anyone please help with this ๐Ÿ™‚

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Yes, Databricks MVs can do more than just rowโ€‘based and appendโ€‘only incremental refresh. ๐Ÿ˜Š

PARTITION_OVERWRITE is still incremental in the sense that... Enzyme figures out which partitions changed since the last refresh... rebuilds just those partitions and overwrites them in the MV, and unchanged partitions are left asโ€‘is, so you avoid a full recompute of the entire MV.

You can see which technique was used for a given refresh via the event log. If you see ... executed as PARTITION_OVERWRITE, thatโ€™s the partitionโ€‘level incremental path in action.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

11 REPLIES 11

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Unfortunately, no. The order by function in the definition of a Lakeflow SDP materialized view, does not guarantee the stored rows will be ordered when you later query the MV. Materialized views are stored like regular Unity Catalog tables... as in they cache the result set of your query, but the engine is free to physically store and return rows in any order unless you ORDER BY in the select that reads from the MV. So, you'll have to use order by when querying the MV.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

 

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor II

Hi @Ashwin_DSA 

So only approach is I need to create view or separate delta table (after using orderby) to get the sorted results?
could you please confirm ๐Ÿ™‚

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Not exactly... You donโ€™t need another view/table to "bake in" the order, and even if you did, it still wouldnโ€™t guarantee ordered results.

As I mentioned in my earlier reply, tables and materialized views never guarantee row order. ORDER BY only guarantees order for the specific SELECT that contains it. SELECT * FROM some_mv; has no guaranteed order, no matter how the MV was created. That applies to plain Delta tables. Wrapping in another view doesnโ€™t fix it either.

For example, if you did something like the below...

 

CREATE VIEW sorted_v AS
SELECT * FROM my_mv ORDER BY some_col;
-- Later tried to query 
SELECT * FROM sorted_v;   -- order NOT guaranteed
 
The outer query can reorder rows unless it has ORDER BY. So, the correct pattern is always order at query time:
SELECT *
FROM my_mv
ORDER BY some_col;
 

The reliable approach is not an extra MV/table. Itโ€™s ordering in every SELECT that needs sorting.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor II

Hi @Ashwin_DSA 

Even the tables does not guarantee ordering could you please explain me the reason just curious..๐Ÿ™‚ I was in perception that using delta tables would solve problem.
And view its a wrapper around select I thought it would work.

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

In the relational model, a table can be seen as a collection of rows, which are defined by a schema. It is important to note that collections or sets have no inherent order, meaning the rows are not arranged in a specific sequence. To obtain an ordered list for a particular query, SQL offers the ORDER BY operation, which is an extra step that converts the unordered collection into a sorted format.

On a distributed engine like Spark/Databricks....Data is spread across many files and partitions. Different tasks read different chunks in parallel. The engine can read partitions in any order, shuffle data for joins/aggregations, and repartition files on OPTIMIZE/compaction.

If Databricks had to preserve a specific global row order for every table, then it would have to serialise a lot of work (kill parallelism). The optimiser couldnโ€™t freely reorder operations, and many physical optimisations (shuffles, repartitions, file compaction) would become impossible or extremely expensive.

This is no different to other relational databases, by the way. I come from an Oracle background. You may find this and this interesting.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

-werners-
Esteemed Contributor III

If you are interested in these kind of topics, Designing Data Intensive Applications by Martin Kleppmann is a really good read (2nd edition is availble since a while)!

  @-werners- Thanks for sharing , it seems intersecting I will go through the book
  @Ashwin_DSA I still have some confusion sorry ๐Ÿ™‚ as normally when we do df.orderBy().display() it works right( as display() is action it also performs all operations thats defined prior to orderby)

Is it like the query planner breaks down operations and engine performs the operations over each partition and if orderBy has to be maintained the optimizer has to perform orderBy for each partition and it kills parallelism as the data is distributed that I was able to follow but only confusion is when we do df.orderBy().display() and it works may be some point that I am missing in understanding sorry ..

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Youโ€™re not missing anything fundamental. Youโ€™re just looking at two different levels:

df.orderBy(col) is just a transformation that adds a global sort node on top of the plan. When you call display() (or show()), Spark plans a job whose final stage is a shuffle + sort that enforces the requested ordering across all partitions. The driver/notebook UI consumes the final, globally sorted result set and displays it to you.

Semantically, this is:

SELECT *
FROM df
ORDER BY col;
This is an outermost ORDER BY on a SELECT, so the engine is required to return rows in that order.

Parallelism isnโ€™t killed. Spark still reads partitions in parallel, shuffles by sort key, sorts within partitions, and then streams the sorted partitions in key order to the client. You pay the cost of a distributed sort, but you still get a globally ordered result.

However, when you do something like the below... 

CREATE OR REPLACE MATERIALIZED VIEW my_mv AS
SELECT ...
FROM v1 JOIN v2 ...
ORDER BY some_col;
Youโ€™re not saying "always return results sorted". Youโ€™re just saying, "Define a table-like object whose contents come from this query."

In relational theory, tables (and MVs) are unordered collections. Physical file order is not part of the contract. The engine is free to drop the ORDER BY when planning the MV refresh, incrementally update only changed partitions, recluster or compact files later, and read/write partitions in any order.

Especially with incremental refresh, itโ€™s basically impossible to maintain a single global sort order cheaply every time new data arrives. Enzymeโ€™s job is to keep the MV logically correct, not physically sorted. So even if the initial full refresh happens to write files in sorted order, later different refresh strategies (ROW_BASED, PARTITION_OVERWRITE, APPEND_ONLY, etc.), compaction, and different execution plans can and will scramble physical order. The contract for the MV is SELECT * FROM my_mv returns the right rows, but in unspecified order unless you ORDER BY in that SELECT.

Thats why you should always sort at query time, never rely on storage order, whether itโ€™s a Delta table, MV, or view.

Hope this helps.

 

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor II

  @Ashwin_DSA  Thank you so much for the explanation   ๐Ÿ™‚
I thought only Row-based & append-only operations are possible in incremental refresh based on row tracking , I was not aware even only a specific partition can be overwritten

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Yes, Databricks MVs can do more than just rowโ€‘based and appendโ€‘only incremental refresh. ๐Ÿ˜Š

PARTITION_OVERWRITE is still incremental in the sense that... Enzyme figures out which partitions changed since the last refresh... rebuilds just those partitions and overwrites them in the MV, and unchanged partitions are left asโ€‘is, so you avoid a full recompute of the entire MV.

You can see which technique was used for a given refresh via the event log. If you see ... executed as PARTITION_OVERWRITE, thatโ€™s the partitionโ€‘level incremental path in action.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor II

Thanks Ashwin๐Ÿ™‚