cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Does Z-ordering speed up reading of a single file?

Erik
Valued Contributor II

Situation: we have one partion per date, and it just so happens that each partition ends up (after optimize) as *a single* 128mb file. We partition on date, and zorder on userid, and our query is something like "find max value of column A where userid=X and date>=somedate".

Does zordering help in any way in this scenario? It is clear that we will have to read every partition after $somedate, but does the zordering on userid somehow help spark when reading inside each of those partitions (remember that each partition is a single file), or do we have to read *and scan* all 128mb of each of the remaining partitions even when we zoptimize?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

Z-Order will make sure that in case you need to read multiple files, these files are co-located.

For a single file this does not matter as a single file is always local to itself.

If you are certain that your spark program will only read a single file, you do not need z-ordering.

But it might be the case that your delta lake table is also read by another program, not using the partition filter. then it will become interesting, or if you have multiple files per partition.

Z-Ordering and partitioning are complementary techniques.

Z-Ordering is especially interesting for columns on which you cannot/don't want to partition (high cardinality)

View solution in original post

6 REPLIES 6

Kaniz_Fatma
Community Manager
Community Manager

Hi @ Erik! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Erik,

It would be great if you could share the CSV file . However:-

This page gives guidelines on which column can be used for ZORDER BY.

This page gives guidelines on how to choose the right partition column.

Erik
Valued Contributor II

@Kaniz Fatma​ I have read the documentation. The question is not about general guidelines regarding partitions and zordering, it is very specifically about the (potential) benefit of zordering when reading single files. To rephrase: is the only advantage of zordering that it allows the skipping of whole files, or is there also some benefit to it after a file has been selected to be read. Does it allow faster searching inside the selected files, or maybe reading only chunks of the files?

Hubert-Dudek
Esteemed Contributor III

ZORDER BY

Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read. You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.

it is from https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html

So for delta files once partition disappear it is really important to have Z-order as it will handle effectively your query, so you need:

OPTIMIZE data ZORDER BY (userid, date)

Erik
Valued Contributor II

@Hubert Dudek​ i don't know what you mean by "when the partition dissappear". I clearly asked this question in a confusing way, but hopefully my answer to @Kaniz Fatma​ helped clarify.

-werners-
Esteemed Contributor III

Z-Order will make sure that in case you need to read multiple files, these files are co-located.

For a single file this does not matter as a single file is always local to itself.

If you are certain that your spark program will only read a single file, you do not need z-ordering.

But it might be the case that your delta lake table is also read by another program, not using the partition filter. then it will become interesting, or if you have multiple files per partition.

Z-Ordering and partitioning are complementary techniques.

Z-Ordering is especially interesting for columns on which you cannot/don't want to partition (high cardinality)

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!