cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Query related to Z ordering

kasiviss42
New Contributor III

I have join on two large tables.

If i apply Z ordering on 3 columns for both the tables ,

I am joining two tables on the basis on the same 3 columns used for Z ordering ,

Will i get any benefit of Z ordering on performance when i use Joins here .

So as per my current knowledge Z ordering is specifically useful for faster query data retrieval i.e. select *from table1 where zordercolumn1 =X

Does Z ordering also helps Joins ? if not how to improve join performance here 

2 REPLIES 2

kasiviss42
New Contributor III

I have asked this query because w.r.t Predicate query push down to storage .

So here as part of joins . Data needs to be loaded into memory first and then join is performed . So How does Z order help here if it can't skip the data being fetched from storage .

saurabh18cs
Valued Contributor III

hi @kasiviss42 indeed z-order helps in file level data skipping . Now answer is 2 fold.

1) if you're joining 2 dataframes then you can make use of data skipping a.k.a z-ordering by performing filters first on those dataframes which will be then used for joins. (predicate push down). your dataframes are already lighter and in-memory is optimized.

2) you can also specify merge predicate during your join if you're using delta native merge . This merge predicate consists of partition pruning predicate + data skipping predicate + join predicate. So the combination of predicate will be applied to filter down to relevant partitions , relevant files if they are z-ordered.

example:

 

        join_predicate = " AND ".join([f"s.{i} <=> t.{i}" for i in record_keys])
        merge_predicate = (
            f"({target_prune_predicate}) AND ({time_filter_skipping_predicate}) AND ({join_predicate})"
            if target_prune != ""
            else join_predicate
        )
        target.alias("t")
            .merge(source_df.alias("s"), merge_predicate)
            .whenMatchedUpdate(set={f"t.`{i}`": f"s.`{i}`" for i in target_columns})
            .whenNotMatchedInsert(values={i: f"s.`{i}`" for i in target_columns})
            .execute()
 
also , good to know this:

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now