Authors: Andrey Mirskiy (@AndreyMirskiy) and Marco Scagliola (@MarcoScagliola)
Welcome to the fifth part of our blog series on 'Why Databricks SQL Serverless is the best fit for BI workloads'.
In the previous blog posts we have covered the following topics:
In this blog post, we will discuss query optimization capability within Databricks SQL which leverages primary key constraints to improve query performance by eliminating unnecessary operations, such as DISTINCT aggregations and unnecessary joins.
This blog explores how to implement and benefit from this optimisation technique, providing practical examples demonstrating how this capability improves query efficiency in Databricks SQL.
Databricks SQL Warehouse allows users to specify informational PK and FK constraints. With the introduction of this new query optimization technique, users are now able to specify Primary Key (PK) constraints with the RELY option, allowing the Databricks query optimizer to utilize these constraints to optimize query execution plans and eliminate unnecessary operations. This technique can improve query performance, particularly in BI workloads..
The specification of primary keys with the RELY option in table alteration statements is shown below in Code 1.
USE CATALOG catalo_name;
ALTER TABLE schema_name.table_name ADD PRIMARY KEY (column_name) RELY;
Currently Databricks SQL leverages this information to optimize query execution in the following scenarios:
These optimizations lead to more efficient query execution, making Databricks SQL faster and more effective.
In this demo scenario we are creating two synthetic environments, in order to create an A/B test on tables without any Primary Key (PK) defined and the same tables with Primary Key (PK) defined with the RELY option.
First, we created test tables by replicating some of the tables from the tpch schema contained within samples catalog.
CREATE CATALOG IF NOT EXISTS join_optimization;
CREATE SCHEMA IF NOT EXISTS join_optimization.tpch;
CREATE OR REPLACE TABLE lineitem AS SELECT * FROM samples.tpch.lineitem;
CREATE OR REPLACE TABLE orders AS SELECT * FROM samples.tpch.orders;
CREATE OR REPLACE TABLE part AS SELECT * FROM samples.tpch.part;
CREATE OR REPLACE TABLE supplier AS SELECT * FROM samples.tpch.supplier;
CREATE OR REPLACE TABLE customer AS SELECT * FROM samples.tpch.customer;
CREATE OR REPLACE TABLE nation AS SELECT * FROM samples.tpch.nation;
CREATE OR REPLACE TABLE region AS SELECT * FROM samples.tpch.region;
Before executing sample queries we may need to drop any primary keys which may exist after previous tests. We used Code 4 for that purpose.
ALTER TABLE orders DROP PRIMARY KEY IF EXISTS;
ALTER TABLE part DROP PRIMARY KEY IF EXISTS;
ALTER TABLE supplier DROP PRIMARY KEY IF EXISTS;
ALTER TABLE customer DROP PRIMARY KEY IF EXISTS;
ALTER TABLE nation DROP PRIMARY KEY IF EXISTS;
ALTER TABLE region DROP PRIMARY KEY IF EXISTS;
On Code 5, you can see the sample query which we used for testing.
set use_cached_result=false;
select sum(l_quantity), min(l_tax)
from lineitem
left join orders on l_orderkey=o_orderkey
left join part on l_partkey=p_partkey
left join supplier on l_suppkey=s_suppkey
left join customer on o_custkey=c_custkey
left join nation on c_nationkey=n_nationkey
left join region on n_regionkey=r_regionkey;
When executed for the very first time (cold execution) we observed the following query execution metrics - see Image 1 below.
Image 1: Cold execution without PK constraints
When executed immediately for the second time (warm execution) we observed the following query execution metrics - see Image 2 below.
Image 2: Warm execution without PK constraints
Next, we tested the same query using a view that encapsulates the logic of joining tables. In code 6, you can see the DDL we created to encapsulate the test logic within a view.
create or replace view v_join_optimization as
select *
from lineitem
left join orders on l_orderkey=o_orderkey
left join part on l_partkey=p_partkey
left join supplier on l_suppkey=s_suppkey
left join customer on o_custkey=c_custkey
left join nation on c_nationkey=n_nationkey
left join region on n_regionkey=r_regionkey;
After restarting the SQL Warehouse to reset the Disk Cache, we executed the test query using a view twice. See Code 7 below.
set use_cached_result=false;
select sum(l_quantity), min(l_tax) from v_join_optimization;
When executed for the first time (cold execution) we observed the following query execution metrics - see Image 3 below.
Image 3: Cold execution without PK constraints using a view
When executed for the second time (warm execution) we observed the following query execution metrics - see Image 4 below.
Image 4: Warm execution without PK constraints using a view
Next, in order to test the benefits of query optimizations we created primary keys with the RELY option - see Code 7 below.
ALTER TABLE orders ALTER COLUMN o_orderkey SET NOT NULL;
ALTER TABLE orders ADD PRIMARY KEY (o_orderkey) RELY;
ALTER TABLE part ALTER COLUMN p_partkey SET NOT NULL;
ALTER TABLE part ADD PRIMARY KEY (p_partkey) RELY;
ALTER TABLE supplier ALTER COLUMN s_suppkey SET NOT NULL;
ALTER TABLE supplier ADD PRIMARY KEY (s_suppkey) RELY;
ALTER TABLE customer ALTER COLUMN c_custkey SET NOT NULL;
ALTER TABLE customer ADD PRIMARY KEY (c_custkey) RELY;
ALTER TABLE nation ALTER COLUMN n_nationkey SET NOT NULL;
ALTER TABLE nation ADD PRIMARY KEY (n_nationkey) RELY;
ALTER TABLE region ALTER COLUMN r_regionkey SET NOT NULL;
ALTER TABLE region ADD PRIMARY KEY (r_regionkey) RELY;
After restarting SQL Warehouse in order to reset Disk Cache, we executed sample queries again.
When executed for the first time (cold execution) we observed the following query execution metrics - see Image 5.
Image 5: Cold query execution with PK RELY constraints
When executed for the second time (warm execution) we observed the following query execution metrics - see Image 6.
Image 6: Warm query execution with PK RELY constraints
Based on the query metrics we can clearly see that query performance is much better and it consumed less CPU (Tasks total time) and scanned less data (Bytes read).
Finally, after restarting SQL Warehouse again in order to reset Disk Cache, we executed sample queries using views.
When executed for the first time (cold execution) we could see the following query execution metrics - see Image 7.
Image 7: Cold execution with PK RELY constraints using a view
When executed for the first time (cold execution) we could see the following query execution metrics - see Image 8.
Image 8: Warm execution with PK RELY constraints using a view
We have summarized test results in the table below.
Table 1: Statistics on the test run with and without Primary Key RELY enabled
Scenario |
Scenario |
Disk Cache |
Total wall clock |
Tasks total time |
Bytes Read |
Files read |
Plain query |
No PKs |
Cold |
5 s 723 ms |
54.84 s |
358.8 MB |
19 |
Plain query |
No PKs |
Warm |
4 s 749 ms |
18.96 s |
358.8 MB |
19 |
View |
No PKs |
Cold |
7 s 850 ms |
55.61 s |
358.8 MB |
19 |
View |
No PKs |
Warm |
2 s 987 ms |
18.54 s |
358.8 MB |
19 |
Plain query |
PK RELY |
Cold |
5 s 309 ms |
16.03 s |
35.9 MB |
11 |
Plain query |
PK RELY |
Warm |
1 s 357 ms |
338 ms |
35.9 MB |
11 |
View |
PK RELY |
Cold |
5 s 821 ms |
16.99 s |
35.9 MB |
11 |
View |
PK RELY |
Warm |
1 s 823 ms |
336 ms |
35.9 MB |
11 |
You can also find total wall clock duration statistics on the chart below; lower is better.
We can clearly see that by leveraging primary key RELY constraints Databricks SQL builds a more efficient execution plan, scans less data from storage, and spends less CPU cycles. Which leads to better overall query performance.
The reason for this performance difference becomes obvious when we compare the query profiles. The image below demonstrates the two query profiles - without PK and with PK RELY. We can see that when the query engine leverages information about the primary key the query profile is much simpler. The query optimizer excluded unnecessary joins, hence the query profile is more efficient.
Image 9: Query profiles without PK and with PK RELY
To see the impact under high workload we implemented a JMeter test plan simulating 10 concurrent users each executing a sample query 100 times.
Image 10: JMeter test results - sample query when no PKs
Image 11: JMeter test results - sample query when using PK RELY
The results are self-explanatory. The query performance when using PK RELY is significantly better. Not only was the average query performance more than 5 times better, but also the standard deviation was much lower. Meaning that end users will experience more consistent performance. Which is very important in high concurrency BI workloads.
In conclusion, leveraging primary key constraints with the RELY option in Databricks SQL can significantly enhance query performance for BI workloads. By specifying these constraints, the Databricks query optimizer can eliminate unnecessary operations such as DISTINCT aggregations and redundant joins, leading to more efficient query execution plans. Our tests demonstrated that queries with primary key constraints executed faster, consumed less CPU, and scanned less data compared to those without constraints. This optimization is particularly beneficial in high concurrency BI workloads, providing consistent and improved performance for end users. Implementing these techniques can help organizations maximize the efficiency and effectiveness of their BI workloads on Databricks SQL.
For more information on Query optimization using primary key constraints, refer to the documentation (Azure | AWS | GCP).
The code artifacts featured in this blog are available in the following GitHub repository.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.