Query related to Z ordering

kasiviss42
New Contributor III

I have join on two large tables.

If i apply Z ordering on 3 columns for both the tables ,

I am joining two tables on the basis on the same 3 columns used for Z ordering ,

Will i get any benefit of Z ordering on performance when i use Joins here .

So as per my current knowledge Z ordering is specifically useful for faster query data retrieval i.e. select *from table1 where zordercolumn1 =X

Does Z ordering also helps Joins ? if not how to improve join performance here 

kasiviss42
New Contributor III

I have asked this query because w.r.t Predicate query push down to storage .

So here as part of joins . Data needs to be loaded into memory first and then join is performed . So How does Z order help here if it can't skip the data being fetched from storage .

saurabh18cs
Honored Contributor III

hi @kasiviss42 indeed z-order helps in file level data skipping . Now answer is 2 fold.

1) if you're joining 2 dataframes then you can make use of data skipping a.k.a z-ordering by performing filters first on those dataframes which will be then used for joins. (predicate push down). your dataframes are already lighter and in-memory is optimized.

2) you can also specify merge predicate during your join if you're using delta native merge . This merge predicate consists of partition pruning predicate + data skipping predicate + join predicate. So the combination of predicate will be applied to filter down to relevant partitions , relevant files if they are z-ordered.

example:

 

        join_predicate = " AND ".join([f"s.{i} <=> t.{i}" for i in record_keys])
        merge_predicate = (
            f"({target_prune_predicate}) AND ({time_filter_skipping_predicate}) AND ({join_predicate})"
            if target_prune != ""
            else join_predicate
        )
        target.alias("t")
            .merge(source_df.alias("s"), merge_predicate)
            .whenMatchedUpdate(set={f"t.`{i}`": f"s.`{i}`" for i in target_columns})
            .whenNotMatchedInsert(values={i: f"s.`{i}`" for i in target_columns})
            .execute()
 
also , good to know this: