cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
AndreyMirskiy
Databricks Employee
Databricks Employee

Authors: Andrey Mirskiy (@AndreyMirskiy) and Marco Scagliola (@MarcoScagliola)

 

Introduction

Welcome to the fifth part of our blog series on 'Why Databricks SQL Serverless is the best fit for BI workloads'.

In the previous blog posts we have covered the following topics:

In this blog post, we will discuss query optimization capability within Databricks SQL which leverages primary key constraints to improve query performance by eliminating unnecessary operations, such as DISTINCT aggregations and unnecessary joins.

This blog explores how to implement and benefit from this optimisation technique, providing practical examples demonstrating how this capability improves query efficiency in Databricks SQL.

 

Query optimization using primary key constraints - Overview

Databricks SQL Warehouse allows users to specify informational PK and FK constraints. With the introduction of this new query optimization technique, users are now able to specify Primary Key (PK) constraints with the RELY option, allowing the Databricks query optimizer to utilize these constraints to optimize query execution plans and eliminate unnecessary operations. This technique can improve query performance, particularly in BI workloads..

The specification of primary keys with the RELY option in table alteration statements is shown below in Code 1. 

  • Code 1: Creating primary key (PK) constraint with RELY option.
    USE CATALOG catalo_name;
    ALTER TABLE schema_name.table_name ADD PRIMARY KEY (column_name) RELY;

Currently Databricks SQL leverages this information to optimize query execution in the following scenarios:

  1. A DISTINCT operation on a primary key column can be skipped. Because of the RELY option the engine knows that all values in that column are unique, therefore no need to perform additional operation.
  2. When using a star-schema like model a fact table can be joined with multiple dimension tables. When a query uses LEFT JOIN on the primary key column with the RELY option the engine may exclude that join operation from actual execution because this operation neither increases nor decreases the resultset.

These optimizations lead to more efficient query execution, making Databricks SQL faster and more effective.

 

Demo Scenario

In this demo scenario we are creating two synthetic environments, in order to create an A/B test on tables without any Primary Key (PK) defined and the same tables with Primary Key (PK)  defined with the RELY option.

Preparation

First, we created test tables by replicating some of the tables from the tpch schema contained within samples catalog.

  • Code 2: Create a Catalog and Schema in order to run the test.
    CREATE CATALOG IF NOT EXISTS join_optimization;
    CREATE SCHEMA IF NOT EXISTS join_optimization.tpch;
  • Code 3: Create the test tables as CTAS
    CREATE OR REPLACE TABLE lineitem AS SELECT * FROM samples.tpch.lineitem;
    CREATE OR REPLACE TABLE orders AS SELECT * FROM samples.tpch.orders;
    CREATE OR REPLACE TABLE part AS SELECT * FROM samples.tpch.part;
    CREATE OR REPLACE TABLE supplier AS SELECT * FROM samples.tpch.supplier;
    CREATE OR REPLACE TABLE customer AS SELECT * FROM samples.tpch.customer;
    CREATE OR REPLACE TABLE nation AS SELECT * FROM samples.tpch.nation;
    CREATE OR REPLACE TABLE region AS SELECT * FROM samples.tpch.region;

Executing Sample Query without PKs

Before executing sample queries we may need to drop any primary keys which may exist after previous tests. We used Code 4 for that purpose.

  • Code 4: Drop any primary keys within the test tables
    ALTER TABLE orders DROP PRIMARY KEY IF EXISTS;
    ALTER TABLE part DROP PRIMARY KEY IF EXISTS;
    ALTER TABLE supplier DROP PRIMARY KEY IF EXISTS;
    ALTER TABLE customer DROP PRIMARY KEY IF EXISTS;
    ALTER TABLE nation DROP PRIMARY KEY IF EXISTS;
    ALTER TABLE region DROP PRIMARY KEY IF EXISTS;

On Code 5, you can see the sample query which we used for testing.

  • Code 5: Test query
    set use_cached_result=false;
    select sum(l_quantity), min(l_tax)
    from lineitem
      left join orders on l_orderkey=o_orderkey
      left join part on l_partkey=p_partkey
      left join supplier on l_suppkey=s_suppkey
      left join customer on o_custkey=c_custkey
      left join nation on c_nationkey=n_nationkey
      left join region on n_regionkey=r_regionkey;

When executed for the very first time (cold execution) we observed the following query execution metrics - see Image 1 below.

  • Total wall-clock duration = 5 s 723 ms
  • Tasks total time = 54.84 s
  • Bytes read = 358.80 MB

AndreyMirskiy_0-1721135999369.png

Image 1: Cold execution without PK constraints

When executed immediately for the second time (warm execution) we observed the following query execution metrics - see Image 2 below.

  • Total wall-clock duration = 4 s 749 ms
  • Tasks total time = 18.96 s
  • Bytes read = 358.80 MB

AndreyMirskiy_2-1721136151105.png

Image 2: Warm execution without PK constraints

Next, we tested the same query using a view that encapsulates the logic of joining tables. In code 6, you can see the DDL we created to encapsulate the test logic  within a view.

  • Code 6: Test query wrapped within a view
    create or replace view v_join_optimization as
    select *
    from lineitem
      left join orders   on l_orderkey=o_orderkey
      left join part     on l_partkey=p_partkey
      left join supplier on l_suppkey=s_suppkey
      left join customer on o_custkey=c_custkey
      left join nation   on c_nationkey=n_nationkey
      left join region   on n_regionkey=r_regionkey;

After restarting the SQL Warehouse to reset the Disk Cache, we executed the test query using a view twice. See Code 7 below.

  • Code 7: Test query wrapped within a view
    set use_cached_result=false;
    select sum(l_quantity), min(l_tax) from v_join_optimization;

When executed for the first time (cold execution) we observed the following query execution metrics - see Image 3 below.

  • Total wall-clock duration = 7 s 850 ms
  • Tasks total time = 55.61 s
  • Bytes read = 358.80 MB

AndreyMirskiy_3-1721136205275.png

Image 3: Cold execution without PK constraints using a view

When executed for the second time (warm execution) we observed the following query execution metrics - see Image 4 below.

  • Total wall-clock duration = 2 s 987 ms
  • Tasks total time = 18.54 s
  • Bytes read = 358.80 MB

AndreyMirskiy_4-1721136258275.png

Image 4: Warm execution without PK constraints using a view

Creating PK Constraints

Next, in order to test the benefits of query optimizations we created primary keys with the RELY option - see Code 7 below.

  • Code 7: Alter the tables to define the primary key using the RELY option
    ALTER TABLE orders ALTER COLUMN o_orderkey SET NOT NULL;
    ALTER TABLE orders ADD PRIMARY KEY (o_orderkey) RELY;
    ALTER TABLE part ALTER COLUMN p_partkey SET NOT NULL;
    ALTER TABLE part ADD PRIMARY KEY (p_partkey) RELY;
    ALTER TABLE supplier ALTER COLUMN s_suppkey SET NOT NULL;
    ALTER TABLE supplier ADD PRIMARY KEY (s_suppkey) RELY;
    ALTER TABLE customer ALTER COLUMN c_custkey SET NOT NULL;
    ALTER TABLE customer ADD PRIMARY KEY (c_custkey) RELY;
    ALTER TABLE nation ALTER COLUMN n_nationkey SET NOT NULL;
    ALTER TABLE nation ADD PRIMARY KEY (n_nationkey) RELY;
    ALTER TABLE region ALTER COLUMN r_regionkey SET NOT NULL;
    ALTER TABLE region ADD PRIMARY KEY (r_regionkey) RELY;

Executing Sample Query with PK Constraints

After restarting SQL Warehouse in order to reset Disk Cache, we executed sample queries again.

When executed for the first time (cold execution) we observed the following query execution metrics - see Image 5.

  • Total wall-clock duration = 5 s 309 ms
  • Tasks total time = 16.03 s
  • Bytes read = 35.90 MB

AndreyMirskiy_0-1721136353262.png

Image 5: Cold query execution with PK RELY constraints

When executed for the second time (warm execution) we observed the following query execution metrics - see Image 6.

  • Total wall-clock duration = 1 s 357 ms
  • Tasks total time = 338 ms
  • Bytes read = 35.90 MB

AndreyMirskiy_1-1721136392831.png

Image 6: Warm query execution with PK RELY constraints

Based on the query metrics we can clearly see that query performance is much better and it consumed less CPU (Tasks total time) and scanned less data (Bytes read).

Finally, after restarting SQL Warehouse again in order to reset Disk Cache, we executed sample queries using views.

When executed for the first time (cold execution) we could see the following query execution metrics - see Image 7.

  • Total wall-clock duration = 5 s 821 ms
  • Tasks total time = 16.99 s
  • Bytes read = 35.90 MB

AndreyMirskiy_2-1721136428281.png

Image 7: Cold execution with PK RELY constraints using a view

When executed for the first time (cold execution) we could see the following query execution metrics - see Image 8.

  • Total wall-clock duration = 1 s 823 ms
  • Tasks total time = 336 ms
  • Bytes read = 35.90 MB

AndreyMirskiy_3-1721136467567.png

Image 8: Warm execution with PK RELY constraints using a view

Summary Results

We have summarized test results in the table below.

Table 1: Statistics on the test run with and without Primary Key RELY enabled 

Scenario

Scenario

Disk Cache

Total wall clock

Tasks total time

Bytes Read

Files read

Plain query

No PKs

Cold

5 s 723 ms

54.84 s

358.8 MB

19

Plain query

No PKs

Warm

4 s 749 ms

18.96 s

358.8 MB

19

View

No PKs

Cold

7 s 850 ms

55.61 s

358.8 MB

19

View

No PKs

Warm

2 s 987 ms

18.54 s

358.8 MB

19

Plain query

PK RELY

Cold

5 s 309 ms

16.03 s

35.9 MB

11

Plain query

PK RELY

Warm

1 s 357 ms

338 ms

35.9 MB

11

View

PK RELY

Cold

5 s 821 ms

16.99 s

35.9 MB

11

View

PK RELY

Warm

1 s 823 ms

336 ms

35.9 MB

11

You can also find total wall clock duration statistics on the chart below; lower is better.

AndreyMirskiy_4-1721136502581.png

We can clearly see that by leveraging primary key RELY constraints Databricks SQL builds a more efficient execution plan, scans less data from storage, and spends less CPU cycles. Which leads to better overall query performance.

The reason for this performance difference becomes obvious when we compare the query profiles. The image below demonstrates the two query profiles - without PK and with PK RELY. We can see that when the query engine leverages information about the primary key the query profile is much simpler. The query optimizer excluded unnecessary joins, hence the query profile is more efficient.

AndreyMirskiy_7-1721136692998.png

Image 9: Query profiles without PK and with PK RELY

JMeter test plan

To see the impact under high workload we implemented a JMeter test plan simulating 10 concurrent users each executing a sample query 100 times.

AndreyMirskiy_10-1721136794176.png

Image 10: JMeter test results - sample query when no PKs

AndreyMirskiy_8-1721136750075.png

Image 11: JMeter test results - sample query when using PK RELY

The results are self-explanatory. The query performance when using PK RELY is significantly better. Not only was the average query performance more than 5 times better, but also the standard deviation was much lower. Meaning that end users will experience more consistent performance. Which is very important in high concurrency BI workloads.

 

Conclusion

In conclusion, leveraging primary key constraints with the RELY option in Databricks SQL can significantly enhance query performance for BI workloads. By specifying these constraints, the Databricks query optimizer can eliminate unnecessary operations such as DISTINCT aggregations and redundant joins, leading to more efficient query execution plans. Our tests demonstrated that queries with primary key constraints executed faster, consumed less CPU, and scanned less data compared to those without constraints. This optimization is particularly beneficial in high concurrency BI workloads, providing consistent and improved performance for end users. Implementing these techniques can help organizations maximize the efficiency and effectiveness of their BI workloads on Databricks SQL.

For more information on Query optimization using primary key constraints, refer to the documentation (Azure | AWS | GCP).

 

Resources

The code artifacts featured in this blog are available in the following GitHub repository.