topic Query related to Z ordering in Data Engineering

Query related to Z ordering

kasiviss42 — Mon, 03 Feb 2025 10:52:10 GMT

I have join on two large tables.

If i apply Z ordering on 3 columns for both the tables ,

I am joining two tables on the basis on the same 3 columns used for Z ordering ,

Will i get any benefit of Z ordering on performance when i use Joins here .

So as per my current knowledge Z ordering is specifically useful for faster query data retrieval i.e. select *from table1 where zordercolumn1 =X

Does Z ordering also helps Joins ? if not how to improve join performance here

Re: Query related to Z ordering

kasiviss42 — Mon, 03 Feb 2025 10:57:32 GMT

I have asked this query because w.r.t Predicate query push down to storage .

So here as part of joins . Data needs to be loaded into memory first and then join is performed . So How does Z order help here if it can't skip the data being fetched from storage .

Re: Query related to Z ordering

saurabh18cs — Tue, 04 Feb 2025 08:30:11 GMT

hi @kasiviss42 indeed z-order helps in file level data skipping . Now answer is 2 fold.

1) if you're joining 2 dataframes then you can make use of data skipping a.k.a z-ordering by performing filters first on those dataframes which will be then used for joins. (predicate push down). your dataframes are already lighter and in-memory is optimized.

2) you can also specify merge predicate during your join if you're using delta native merge . This merge predicate consists of partition pruning predicate + data skipping predicate + join predicate. So the combination of predicate will be applied to filter down to relevant partitions , relevant files if they are z-ordered.

example:

join_predicate = " AND ".join([f"s.{i} <=> t.{i}" for i in record_keys])

merge_predicate = (

f"({target_prune_predicate}) AND ({time_filter_skipping_predicate}) AND ({join_predicate})"

if target_prune != ""

else join_predicate

)

target.alias("t")

.merge(source_df.alias("s"), merge_predicate)

.whenMatchedUpdate(set={f"t.`{i}`": f"s.`{i}`" for i in target_columns})

.whenNotMatchedInsert(values={i: f"s.`{i}`" for i in target_columns})

.execute()

also , good to know this:

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-pruning.html