Databricks Community

AndreyMirskiy · ‎01-29-2024

Authors: Andrey Mirskiy (@AndreyMirskiy) and Marco Scagliola (@MarcoScagliola)

Introduction

Welcome to the second part of our blog series on “Why Databricks SQL Serverless is the best fit for BI workloads”.

Previous posts in the series:

Part 1: Disk Cache

This blog post touches on best practices for implementing performance test cases on Databricks SQL Warehouse, leveraging Apache JMeter, a widely used open-source testing tool. Our goal is to showcase various techniques for creating modular test plans and metrics that will enhance your performance testing strategy, enabling a data-driven approach to ensure the efficient operation of workloads on your Databricks SQL Warehouse instance.

For performance and testing purposes in the context of BI workloads, it's essential to replicate the actions of various end-users, especially during high-concurrency workloads. This method is commonly referred to as a "Concurrency Test". Within BI workloads, this test verifies that the system can accommodate the anticipated number of concurrent users without affecting performance.

In this blog, we’ll show you how to use Apache JMeter, an open-source load testing and performance measurement, to develop robust concurrency tests for Databricks SQL.

Apache JMeter Overview

Apache JMeter™, developed by the Apache Software Foundation, is an open-source tool, primarily designed for load testing, performance measurement, and functional testing of various applications.

Personalizing the environment of Apache JMeter for testing BI workloads involves configuring its settings and extensions to cater specifically to the nuances of BI systems. Here are steps and suggestions on how to tailor the JMeter environment for BI testing:

Database Testing Configuration: Set up JMeter's JDBC Request to connect to your BI database systems. Save connection configurations for different databases you frequently test against.
Custom SQL Queries: Store and organize frequently used SQL queries, especially those that represent typical BI tasks like aggregation, joins, or window functions.
Report Simulation: If your BI system includes reporting tools, configure HTTP or Web Service samplers to simulate user interactions with these reports, mimicking real-world report generation and viewing.
Data Load Profiles: Create test plans that simulate typical BI data loading processes, whether ETL jobs or real-time data ingestion.
Plugins for BI Systems: Install plugins, if available, that enhance JMeter's capability to interact with specific BI tools or platforms you use.
Parameterization: Utilize Variables, Parameters, Scripting, CSV Data Set Config or other parameterization methods to feed varied datasets into your BI tests, simulating different user queries and data scenarios.
Assertions for Data Integrity: Configure assertions to validate the correctness of data processing, ensuring that BI transformations, calculations, and aggregations are accurate.
Extensible Listeners for BI Metrics: Integrate built-in or develop custom listeners to capture and visualize BI-specific metrics, such as query execution time, data processing latency, or cache hit rates.
Distributed Testing for Large number of users: Set up distributed testing for scenarios where a large number of users are involved, ensuring that the BI system can handle enterprise-scale data loads.

By tailoring JMeter's environment specifically for BI workloads, testers can ensure a more accurate simulation of real-world BI operations, capture relevant performance metrics, and ensure the reliability and efficiency of BI systems under various conditions.

JMeter Installation

To install Apache JMeter, download the binaries from its official download page and then unzip them on your local machine.

Please note that Apache JMeter requires Java 8 or higher. You can verify the version of Java on your machine by running the following command.

Code 1: Verify the Java version installed.
```
java -version
```

Additionally, for MacOS users, JMeter can be installed using Homebrew. Obviously this assumes that Homebrew is installed on the MacOS environment.

Code 2: Verify the Homebrew version installed.
```
brew -version
```

Code 3: Installation of JMeter on MacOS using Homebrew.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install jmeter

To connect to Databricks SQL, and execute SQL-queries, the Databricks JDBC-driver must be installed and configured in Apache JMeter.

Download the latest version of Databricks JDBC-driver here.
Unzip the archive.
Make DatabricksJDBC**.jar file available for JMeter following one of the options

Copy DatabricksJDBC**.jar file to JMeter lib directory to make it available for all JMeter test plans.
Add the path to DatabricksJDBC**.jar file to Library paths when configuring JMeter test plan. This will make JDBC-driver available to that specific test plan.

Image 1: Adding the path to Databricks JDBC-driver

JMeter Test Plans

In this section we are delving into the creation of a test plan in Apache JMeter. Essentially, a test plan describes a series of steps JMeter will execute when run. A complete test plan will consist of one or more Thread Groups, configuration elements, listeners, and other elements.

A test plan is the root element in JMeter, under which all other elements are nested. It defines the overall settings for a test, such as user-defined variables, loading and saving configurations, and more.

Following in this blog post, we are going to demonstrate how to create:

A Basic test plan: This encompasses a database connection followed by a direct query execution.
An Advanced test plan: This involves a database connection and the implementation of an exhaustive testing strategy.

Basic Test Plan

In this section, we delve into the process of creating a basic JMeter test plan, outlining each step necessary to set up a simple testing scenario effectively. The basic test plan in JMeter is structured as follows.

Image 2: JMeter Basic Test Plan

Let’s look at the individual components of this test plan.

The root node of the test plan is the test plan itself. On this JMeter object, we can specify common variables that will be used throughout the test plan, as well as the paths for the plan’s required jar-files. Here we usually specify the path to Databricks JDBC driver jar-file (see previous section for details).
The next item we need to configure is Databricks SQL connection definition.
The key properties to include are:
- Database URL - this is the actual JDBC connection string.
- Username - this must be set to “token” as JMeter supports only PAT authentication (Personal Access Token).
- Password - this should be set to the actual value of PAT.
  
  Image 3: JDBC Connection Configuration
  
  A sample connection string looks as follows.
  
  Code 4: Example of JDBC connection string.
```
jdbc:databricks://********.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/******;ConnCatalog=samples;ConnSchema=tpch
```
A Thread group in Apache JMeter defines the simulated workload, which includes the list of requests (such as SQL-queries) to be executed, the number of threads (representing simulated users), and the number of iterations.

Image 4: Thread Group configuration
In this example, the Thread Group is configured to simulate 10 users, with each user executing 2 sample queries sequentially. The loop count is set to 3, meaning that each simulated user will execute the sample query 3 times.

Sample queries, in this context, are based on the JDBC Request component in JMeter and basically include sample SQL-queries.

Code 5: Example of SQL query.

select
   sum(l_extendedprice * l_discount) as revenue
from
   lineitem
where
   l_shipdate >= date '1995-01-01'
   and l_shipdate < date '1995-01-01' + interval '1' year
   and l_discount between 0.08 - 0.01 and 0.08 + 0.01
   and l_quantity < 25;

Finally, JMeter includes several components known as listeners, which collect test execution statistics in various forms. In our example we have:
1. View Results Tree - this listener collects statistics for every executed request, including the request body (SQL-query), the response (result set), and the execution time.
2. Summary Report - this collects summarized statistics for every query, including minimum, maximum, and average execution times.
3. View Results in Table - this collects statistics into a straightforward tabular list.

Advanced Test Plan

While the Basic Test Plan enables test practitioners to conduct simple performance tests, it may not offer the extensive flexibility often required for comprehensive performance testing. Therefore, we recommend an Advanced Test Plan, which provides additional flexibility and more configuration options.

Disable Query Result Cache

In the Databricks SQL Query History we often find that queries hit the Query Result Cache, resulting in extremely fast responses. While this technique benefits end-user experience in production environments, it is typically not ideal for load/performance testing. In such tests we aim to test the actual query performance, not the performance of the cache. To eliminate the impact of Query Result Cache we can disable it. This can be achieved through two approaches listed below.

Option one: Disable Query Result Cache at query level

To disable the Query Result Cache at query level we can use the following command which is to be executed prior to any actual test query.

Code 6: Example of disabling the SQL query result cache.
```
SET use_cached_result = false;
```

To ensure this command is executed before actual test queries, we need to add a JDBC PreProcessor task in JMeter. This can be achieved by navigating from 'Thread Group', then selecting 'Add', followed by 'Pre Processors', and finally choosing 'JDBC PreProcessor'. As illustrated in image 5

Image 5: Disable Query Result Cache using JDBC PreProcessor

Option two: Disable Query Result Cache at connection level

Although the previously mentioned method is effective as it is self-explanatory and visual, it has a performance impact since every call requires a network hop and few extra milliseconds to execute the statement on the Databricks SQL end.

We can save this time by leveraging a connection string parameter to disable the Query Result Cache (QRC) at connection level.

Code 7: Example of disabling the SQL query result cache at the connection string level.

jdbc:databricks://******.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/******;ConnCatalog=samples;ConnSchema=tpch;use_cached_result=false;

By leveraging this technique, we can now disable or completely remove the “Disable QRC” JDBC PreProcessor step, as QRC is already disabled at the connection level for all queries which use that specific JDBC connection.

Disk Cache Warm-up

In the previous blog post we discussed Disk Cache and its impact on SQL query performance. From a performance testing perspective, Disk Cache is significant because during cold runs (when the cache is not yet populated) we may observe significantly lower performance, which would impact overall results.

Image 6: IO statistics when running a query on a cold cache

Therefore, we may want to warm up the Disk Cache before executing any test queries. This can be achieved by leveraging the so-called setUp Thread Group, which is executed only once before any other Thread Groups are run. The process involves the following steps:

Navigate to the ‘Test Plan’, then select ‘Add’, followed by 'Threads', and finally choose the ‘setUp Thread Group’.
On the ‘setUp Thread Group’, select ‘Add’, followed by ‘Sampler’, and finally choose the ‘JDBC Request’.
Configure the ‘JDBC Request’ with the following parameters:

Variable Name = use variable name for your JDBC connection
Query Type = Callable Statement
SQL Query
```
CACHE SELECT * FROM lineitem;
```

If needed, add more requests/queries for other relevant tables within your testing scope.

Image 7: Disk Cache warm-up

When running the entire test plan, we can observe the queries hitting the Disk Cache by checking the “Bytes read from cache” metric in the Query History pane of the query.

Image 8: IO statistics when running a query on a warm cache

Query Parameterization

Now, let’s examine the following query.

Code 8: Example of disabling the SQL query result cache at the connection string level.

select
   sum(l_extendedprice * l_discount) as revenue
from
   lineitem
where
   l_shipdate >= date '1995-01-01'
   and l_shipdate < date '1995-01-01' + interval '1' year
   and l_discount between 0.08 - 0.01 and 0.08 + 0.01
   and l_quantity < 25;

Note the filter on the ship date column (l_shipdate). Currently, the query uses a fixed internal. But what if we want to randomize the interval each time the query is executed?

To accomplish this, we can include a Random Variable configuration element to the respective JDBC request in our Test Plan.

Navigate to ‘JDBC Request’, then select ‘Add’, followed by ‘Config Element’, and finally choose ‘Random Variable’.
Configure the ‘Random Variable’ with the following parameters:
- Variable Name = YEAR
- Minimum Value = 1
- Maximum Value = 5
- Per Thread = True
  
  Image 9: Random Variable parameters configuration

Update the SQL-query by replacing the scalar value with a macro-variable.

Code 9: Example of SQL query having a macro-variable.

select
   sum(l_extendedprice * l_discount) as revenue
from
   lineitem
where
   l_shipdate >= date '1995-01-01'
   and l_shipdate < date '1995-01-01' + interval '${YEAR}' year
   and l_discount between 0.08 - 0.01 and 0.08 + 0.01
   and l_quantity < 25;

The query can be further enhanced by parameterizing the date using JMeter built-in scripting.

Code 10: Example of SQL query having a parameterized date.

select
   sum(l_extendedprice * l_discount) as revenue
from
   lineitem
where
   l_shipdate >= date '${__RandomDate(,1995-01-01,2002-01-01,,sampleDate)}'
   and l_shipdate < date '${sampleDate}' + interval '${YEAR}' year
   and l_discount between 0.08 - 0.01 and 0.08 + 0.01
   and l_quantity < 25;

Image 10: Test Plan with parameterized query

Test Plan Configuration

Variables

Now when we have implemented the entire workload in our Test Plan, it’s a good practice to parameterize it to enable quick changes of all parameters in a single location.

To do this, let’s introduce several User Defined Variables at the root Test Plan level and update the respective Test Plan items accordingly.

HostName, specifies the host name of the target Databricks SQL Warehouse instance on which you want to perform performance testing. It defines the location where JMeter will send JDBC requests during the test.
HttpPath, refers to the specific HTTP path of the target Databricks SQL Warehouse instance you want to access or test..
Catalog, represents a Catalog object within a Databricks Unity Catalog.
Schema, represents a Schema or Database within a Databricks Unity Catalog.
UseCachedResult, this configuration determines whether Databricks SQL should utilize Query Result Cache (previously cached resultsets).
NumberOfThreads, specifies the number of concurrent virtual users or threads that JMeter will simulate during the test.
LoopCount, sets the number of times each virtual user (thread) will repeat the test scenario.

Image 11: Test Plan with User Defined Variables

By applying the parameterization technique, the JDBC connection string should look like the following:

Code 11: Example of parameterized JDBC connection string.

jdbc:databricks://${HostName}:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=${HttpPath};ConnCatalog=${Catalog};ConnSchema=${Schema};use_cached_result=${UseCachedResult};

While the Thread Group settings should look like the following:

Image 12: Parameterized Thread Group

At this point, we can easily find and change all important configuration items at the root of the Test Plan. For example, we can easily:

Change HttpPath variable to run the test on another SQL Warehouse endpoint, e.g. different size or scaling options;
Change default Catalog or Schema to execute queries against a different scale dataset;
Change UseCachedResult to validate the impact of Query Result Cache on query performance;
Change NumberOfThreads and LoopCount to simulate different numbers of users and repeat the queries multiple times.

Parameters

The approach mentioned above with User Defined Variables works well when using JMeter’s GUI. However, in real testing scenarios, it is recommended to use the command line version of JMeter to run the tests while eliminating the impact of the GUI. Nevertheless, JMeter CLI supports what are known as ‘parameters’ which are not the same as the ‘variables’ we previously configured.

To utilize parameters, we need to change the way we configure User Defined Parameters. It should now follow the format below:

Code 12: Example of the User Defined parameters syntax.
```
${__P(parameter_name,default_value)}
```

Here:

parameter_name - is the name of a parameter used by JMeter CLI.
default_value - is the default value used by JMeter if the value is not provided via command line.

Therefore, the User Defined Variables for the Test Plan should appear as depicted in image 12.

Image 13: User Defined Variables using Parameters

Now we can specify configuration parameters directly in the command line. In the following example we specify two parameters - UseCachedResult and NumberOfUsers.

Code 13: Example of command line with two parameters defined.

jmeter -n -t "Advanced Configuration.jmx" -JUseCachedResult=false -JNumberOfUsers=10 -l "Advanced Configuration log.csv"

user.properties file

When running JMeter from the command line, it is convenient to use the ‘user.properties’ file to provide parameter values in an easy-to-read key and value format within a text file. Building on the previous example, the file would look as follows:

Code 14: Example of user.properties file.

HostName=******.azuredatabricks.net
HttpPath=/sql/1.0/warehouses/******
Catalog=samples
Schema=tpch
UseCachedResult=false
NumberOfThreads=5
LoopCount=100

Now, we can directly specify the properties file in the command line. This simplifies the process of conducting performance tests against multiple configurations by using different properties files.

Code 15: Example of command line with user.properties file.

jmeter -n -t "Advanced Configuration.jmx" -p user.properties

Please note the importance of the case sensitivity; ‘-p’ and ‘-P’ are two distinct command line options.

Conclusion

In conclusion, this blog post has delved into different approaches to implement performance-based test plans leveraging Apache JMeter capabilities when conducting basic and advanced performance test plans against Databricks SQL Warehouse workloads. This integration with commonly used open-source tools demonstrates the flexibility and adaptability of Databricks SQL Warehouse.

Data engineers and testers can meticulously tailor their test plans to replicate real-world scenarios, comprehensively evaluate system performance, and pinpoint potential bottlenecks using parameterization and user-defined variables.

JMeter's flexibility, inherent scalability, and seamless integration with Databricks SQL Warehouse position it as an optimal tool for experimenting with the robustness and efficiency of data-driven applications.

Furthermore, we have demonstrated how to implement test repeatability in Apache JMeter, allowing testers to run the same test scenarios consistently and observe how the system behaves over time.

Its built-in metrics and reporting features provide insightful data on various performance aspects, enabling data engineers to make informed decisions and continuously improve system performance.

In summary, Databricks SQL Warehouse's inherent scalability and seamless integration with widely used performance and testing tools tangibly denotes, through testing and experimenting, its robustness and efficiency for intense data-driven applications.