Databricks Community

AndreyMirskiy · ‎07-16-2024

Authors: Andrey Mirskiy (@AndreyMirskiy) and Marco Scagliola (@MarcoScagliola)

Introduction
Query optimization using primary key constraints - Overview
Demo Scenario
Preparation
Executing Sample Query without PKs
Creating PK Constraints
Executing Sample Query with PK Constraints
Summary Results
JMeter test plan
Conclusion
Resources

Introduction

Welcome to the fifth part of our blog series on 'Why Databricks SQL Serverless is the best fit for BI workloads'.

In the previous blog posts we have covered the following topics:

In this blog post, we will discuss query optimization capability within Databricks SQL which leverages primary key constraints to improve query performance by eliminating unnecessary operations, such as DISTINCT aggregations and unnecessary joins.

This blog explores how to implement and benefit from this optimisation technique, providing practical examples demonstrating how this capability improves query efficiency in Databricks SQL.

Query optimization using primary key constraints - Overview

Databricks SQL Warehouse allows users to specify informational PK and FK constraints. With the introduction of this new query optimization technique, users are now able to specify Primary Key (PK) constraints with the RELY option, allowing the Databricks query optimizer to utilize these constraints to optimize query execution plans and eliminate unnecessary operations. This technique can improve query performance, particularly in BI workloads..

The specification of primary keys with the RELY option in table alteration statements is shown below in Code 1.

Code 1: Creating primary key (PK) constraint with RELY option.

USE CATALOG catalo_name;
ALTER TABLE schema_name.table_name ADD PRIMARY KEY (column_name) RELY;

Currently Databricks SQL leverages this information to optimize query execution in the following scenarios:

A DISTINCT operation on a primary key column can be skipped. Because of the RELY option the engine knows that all values in that column are unique, therefore no need to perform additional operation.
When using a star-schema like model a fact table can be joined with multiple dimension tables. When a query uses LEFT JOIN on the primary key column with the RELY option the engine may exclude that join operation from actual execution because this operation neither increases nor decreases the resultset.

These optimizations lead to more efficient query execution, making Databricks SQL faster and more effective.

Demo Scenario

In this demo scenario we are creating two synthetic environments, in order to create an A/B test on tables without any Primary Key (PK) defined and the same tables with Primary Key (PK) defined with the RELY option.

Preparation

First, we created test tables by replicating some of the tables from the tpch schema contained within samples catalog.

Code 2: Create a Catalog and Schema in order to run the test.

CREATE CATALOG IF NOT EXISTS join_optimization;
CREATE SCHEMA IF NOT EXISTS join_optimization.tpch;

Code 3: Create the test tables as CTAS

CREATE OR REPLACE TABLE lineitem AS SELECT * FROM samples.tpch.lineitem;
CREATE OR REPLACE TABLE orders AS SELECT * FROM samples.tpch.orders;
CREATE OR REPLACE TABLE part AS SELECT * FROM samples.tpch.part;
CREATE OR REPLACE TABLE supplier AS SELECT * FROM samples.tpch.supplier;
CREATE OR REPLACE TABLE customer AS SELECT * FROM samples.tpch.customer;
CREATE OR REPLACE TABLE nation AS SELECT * FROM samples.tpch.nation;
CREATE OR REPLACE TABLE region AS SELECT * FROM samples.tpch.region;

Executing Sample Query without PKs

Before executing sample queries we may need to drop any primary keys which may exist after previous tests. We used Code 4 for that purpose.

Code 4: Drop any primary keys within the test tables

ALTER TABLE orders DROP PRIMARY KEY IF EXISTS;
ALTER TABLE part DROP PRIMARY KEY IF EXISTS;
ALTER TABLE supplier DROP PRIMARY KEY IF EXISTS;
ALTER TABLE customer DROP PRIMARY KEY IF EXISTS;
ALTER TABLE nation DROP PRIMARY KEY IF EXISTS;
ALTER TABLE region DROP PRIMARY KEY IF EXISTS;

On Code 5, you can see the sample query which we used for testing.

Code 5: Test query

set use_cached_result=false;
select sum(l_quantity), min(l_tax)
from lineitem
  left join orders on l_orderkey=o_orderkey
  left join part on l_partkey=p_partkey
  left join supplier on l_suppkey=s_suppkey
  left join customer on o_custkey=c_custkey
  left join nation on c_nationkey=n_nationkey
  left join region on n_regionkey=r_regionkey;

When executed for the very first time (cold execution) we observed the following query execution metrics - see Image 1 below.

Total wall-clock duration = 5 s 723 ms
Tasks total time = 54.84 s
Bytes read = 358.80 MB

Image 1: Cold execution without PK constraints

When executed immediately for the second time (warm execution) we observed the following query execution metrics - see Image 2 below.

Total wall-clock duration = 4 s 749 ms
Tasks total time = 18.96 s
Bytes read = 358.80 MB

Image 2: Warm execution without PK constraints

Next, we tested the same query using a view that encapsulates the logic of joining tables. In code 6, you can see the DDL we created to encapsulate the test logic within a view.

Code 6: Test query wrapped within a view

create or replace view v_join_optimization as
select *
from lineitem
  left join orders   on l_orderkey=o_orderkey
  left join part     on l_partkey=p_partkey
  left join supplier on l_suppkey=s_suppkey
  left join customer on o_custkey=c_custkey
  left join nation   on c_nationkey=n_nationkey
  left join region   on n_regionkey=r_regionkey;

After restarting the SQL Warehouse to reset the Disk Cache, we executed the test query using a view twice. See Code 7 below.

Code 7: Test query wrapped within a view

set use_cached_result=false;
select sum(l_quantity), min(l_tax) from v_join_optimization;

When executed for the first time (cold execution) we observed the following query execution metrics - see Image 3 below.

Total wall-clock duration = 7 s 850 ms
Tasks total time = 55.61 s
Bytes read = 358.80 MB

Image 3: Cold execution without PK constraints using a view

When executed for the second time (warm execution) we observed the following query execution metrics - see Image 4 below.

Total wall-clock duration = 2 s 987 ms
Tasks total time = 18.54 s
Bytes read = 358.80 MB

Image 4: Warm execution without PK constraints using a view

Creating PK Constraints

Next, in order to test the benefits of query optimizations we created primary keys with the RELY option - see Code 7 below.

Code 7: Alter the tables to define the primary key using the RELY option

ALTER TABLE orders ALTER COLUMN o_orderkey SET NOT NULL;
ALTER TABLE orders ADD PRIMARY KEY (o_orderkey) RELY;
ALTER TABLE part ALTER COLUMN p_partkey SET NOT NULL;
ALTER TABLE part ADD PRIMARY KEY (p_partkey) RELY;
ALTER TABLE supplier ALTER COLUMN s_suppkey SET NOT NULL;
ALTER TABLE supplier ADD PRIMARY KEY (s_suppkey) RELY;
ALTER TABLE customer ALTER COLUMN c_custkey SET NOT NULL;
ALTER TABLE customer ADD PRIMARY KEY (c_custkey) RELY;
ALTER TABLE nation ALTER COLUMN n_nationkey SET NOT NULL;
ALTER TABLE nation ADD PRIMARY KEY (n_nationkey) RELY;
ALTER TABLE region ALTER COLUMN r_regionkey SET NOT NULL;
ALTER TABLE region ADD PRIMARY KEY (r_regionkey) RELY;

Executing Sample Query with PK Constraints

After restarting SQL Warehouse in order to reset Disk Cache, we executed sample queries again.

When executed for the first time (cold execution) we observed the following query execution metrics - see Image 5.

Total wall-clock duration = 5 s 309 ms
Tasks total time = 16.03 s
Bytes read = 35.90 MB

Image 5: Cold query execution with PK RELY constraints

When executed for the second time (warm execution) we observed the following query execution metrics - see Image 6.

Total wall-clock duration = 1 s 357 ms
Tasks total time = 338 ms
Bytes read = 35.90 MB

Image 6: Warm query execution with PK RELY constraints

Based on the query metrics we can clearly see that query performance is much better and it consumed less CPU (Tasks total time) and scanned less data (Bytes read).

Finally, after restarting SQL Warehouse again in order to reset Disk Cache, we executed sample queries using views.

When executed for the first time (cold execution) we could see the following query execution metrics - see Image 7.

Total wall-clock duration = 5 s 821 ms
Tasks total time = 16.99 s
Bytes read = 35.90 MB

Image 7: Cold execution with PK RELY constraints using a view

When executed for the first time (cold execution) we could see the following query execution metrics - see Image 8.

Total wall-clock duration = 1 s 823 ms
Tasks total time = 336 ms
Bytes read = 35.90 MB

Image 8: Warm execution with PK RELY constraints using a view

Summary Results

We have summarized test results in the table below.

Table 1: Statistics on the test run with and without Primary Key RELY enabled

Scenario	Scenario	Disk Cache	Total wall clock	Tasks total time	Bytes Read	Files read
Plain query	No PKs	Cold	5 s 723 ms	54.84 s	358.8 MB	19
Plain query	No PKs	Warm	4 s 749 ms	18.96 s	358.8 MB	19
View	No PKs	Cold	7 s 850 ms	55.61 s	358.8 MB	19
View	No PKs	Warm	2 s 987 ms	18.54 s	358.8 MB	19
Plain query	PK RELY	Cold	5 s 309 ms	16.03 s	35.9 MB	11
Plain query	PK RELY	Warm	1 s 357 ms	338 ms	35.9 MB	11
View	PK RELY	Cold	5 s 821 ms	16.99 s	35.9 MB	11
View	PK RELY	Warm	1 s 823 ms	336 ms	35.9 MB	11

You can also find total wall clock duration statistics on the chart below; lower is better.

We can clearly see that by leveraging primary key RELY constraints Databricks SQL builds a more efficient execution plan, scans less data from storage, and spends less CPU cycles. Which leads to better overall query performance.

The reason for this performance difference becomes obvious when we compare the query profiles. The image below demonstrates the two query profiles - without PK and with PK RELY. We can see that when the query engine leverages information about the primary key the query profile is much simpler. The query optimizer excluded unnecessary joins, hence the query profile is more efficient.

Image 9: Query profiles without PK and with PK RELY

JMeter test plan

To see the impact under high workload we implemented a JMeter test plan simulating 10 concurrent users each executing a sample query 100 times.

Image 10: JMeter test results - sample query when no PKs

Image 11: JMeter test results - sample query when using PK RELY

The results are self-explanatory. The query performance when using PK RELY is significantly better. Not only was the average query performance more than 5 times better, but also the standard deviation was much lower. Meaning that end users will experience more consistent performance. Which is very important in high concurrency BI workloads.

Conclusion

In conclusion, leveraging primary key constraints with the RELY option in Databricks SQL can significantly enhance query performance for BI workloads. By specifying these constraints, the Databricks query optimizer can eliminate unnecessary operations such as DISTINCT aggregations and redundant joins, leading to more efficient query execution plans. Our tests demonstrated that queries with primary key constraints executed faster, consumed less CPU, and scanned less data compared to those without constraints. This optimization is particularly beneficial in high concurrency BI workloads, providing consistent and improved performance for end users. Implementing these techniques can help organizations maximize the efficiency and effectiveness of their BI workloads on Databricks SQL.

For more information on Query optimization using primary key constraints, refer to the documentation (Azure | AWS | GCP).

Resources

The code artifacts featured in this blog are available in the following GitHub repository.

Databricks Community

Why DBSQL is Best for BI Workloads - Part 5: Query Optimization with Primary Key Constraints

Introduction

Query optimization using primary key constraints - Overview

Demo Scenario

Preparation

Executing Sample Query without PKs

Creating PK Constraints

Executing Sample Query with PK Constraints

Summary Results

JMeter test plan

Conclusion

Resources

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

UC Catalog Cloning: An Automated Approach