Authors: Andrey Mirskiy (@AndreyMirskiy) and Marco Scagliola (@MarcoScagliola)
Welcome to the second part of our blog series on “Why Databricks SQL Serverless is the best fit for BI workloads”.
Previous posts in the series:
This blog post touches on best practices for implementing performance test cases on Databricks SQL Warehouse, leveraging Apache JMeter, a widely used open-source testing tool. Our goal is to showcase various techniques for creating modular test plans and metrics that will enhance your performance testing strategy, enabling a data-driven approach to ensure the efficient operation of workloads on your Databricks SQL Warehouse instance.
For performance and testing purposes in the context of BI workloads, it's essential to replicate the actions of various end-users, especially during high-concurrency workloads. This method is commonly referred to as a "Concurrency Test". Within BI workloads, this test verifies that the system can accommodate the anticipated number of concurrent users without affecting performance.
In this blog, we’ll show you how to use Apache JMeter, an open-source load testing and performance measurement, to develop robust concurrency tests for Databricks SQL.
Apache JMeter™, developed by the Apache Software Foundation, is an open-source tool, primarily designed for load testing, performance measurement, and functional testing of various applications.
Personalizing the environment of Apache JMeter for testing BI workloads involves configuring its settings and extensions to cater specifically to the nuances of BI systems. Here are steps and suggestions on how to tailor the JMeter environment for BI testing:
By tailoring JMeter's environment specifically for BI workloads, testers can ensure a more accurate simulation of real-world BI operations, capture relevant performance metrics, and ensure the reliability and efficiency of BI systems under various conditions.
To install Apache JMeter, download the binaries from its official download page and then unzip them on your local machine.
Please note that Apache JMeter requires Java 8 or higher. You can verify the version of Java on your machine by running the following command.
java -version
Additionally, for MacOS users, JMeter can be installed using Homebrew. Obviously this assumes that Homebrew is installed on the MacOS environment.
brew -version
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install jmeter
To connect to Databricks SQL, and execute SQL-queries, the Databricks JDBC-driver must be installed and configured in Apache JMeter.
Image 1: Adding the path to Databricks JDBC-driver
In this section we are delving into the creation of a test plan in Apache JMeter. Essentially, a test plan describes a series of steps JMeter will execute when run. A complete test plan will consist of one or more Thread Groups, configuration elements, listeners, and other elements.
A test plan is the root element in JMeter, under which all other elements are nested. It defines the overall settings for a test, such as user-defined variables, loading and saving configurations, and more.
Following in this blog post, we are going to demonstrate how to create:
In this section, we delve into the process of creating a basic JMeter test plan, outlining each step necessary to set up a simple testing scenario effectively. The basic test plan in JMeter is structured as follows.
Image 2: JMeter Basic Test Plan
Let’s look at the individual components of this test plan.
Image 3: JDBC Connection Configuration
A sample connection string looks as follows.
Code 4: Example of JDBC connection string.
jdbc:databricks://********.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/******;ConnCatalog=samples;ConnSchema=tpch
Image 4: Thread Group configuration
In this example, the Thread Group is configured to simulate 10 users, with each user executing 2 sample queries sequentially. The loop count is set to 3, meaning that each simulated user will execute the sample query 3 times.select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1995-01-01'
and l_shipdate < date '1995-01-01' + interval '1' year
and l_discount between 0.08 - 0.01 and 0.08 + 0.01
and l_quantity < 25;
While the Basic Test Plan enables test practitioners to conduct simple performance tests, it may not offer the extensive flexibility often required for comprehensive performance testing. Therefore, we recommend an Advanced Test Plan, which provides additional flexibility and more configuration options.
In the Databricks SQL Query History we often find that queries hit the Query Result Cache, resulting in extremely fast responses. While this technique benefits end-user experience in production environments, it is typically not ideal for load/performance testing. In such tests we aim to test the actual query performance, not the performance of the cache. To eliminate the impact of Query Result Cache we can disable it. This can be achieved through two approaches listed below.
To disable the Query Result Cache at query level we can use the following command which is to be executed prior to any actual test query.
SET use_cached_result = false;
To ensure this command is executed before actual test queries, we need to add a JDBC PreProcessor task in JMeter. This can be achieved by navigating from 'Thread Group', then selecting 'Add', followed by 'Pre Processors', and finally choosing 'JDBC PreProcessor'. As illustrated in image 5
Image 5: Disable Query Result Cache using JDBC PreProcessor
Although the previously mentioned method is effective as it is self-explanatory and visual, it has a performance impact since every call requires a network hop and few extra milliseconds to execute the statement on the Databricks SQL end.
We can save this time by leveraging a connection string parameter to disable the Query Result Cache (QRC) at connection level.
jdbc:databricks://******.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/******;ConnCatalog=samples;ConnSchema=tpch;use_cached_result=false;
By leveraging this technique, we can now disable or completely remove the “Disable QRC” JDBC PreProcessor step, as QRC is already disabled at the connection level for all queries which use that specific JDBC connection.
In the previous blog post we discussed Disk Cache and its impact on SQL query performance. From a performance testing perspective, Disk Cache is significant because during cold runs (when the cache is not yet populated) we may observe significantly lower performance, which would impact overall results.
Image 6: IO statistics when running a query on a cold cache
Therefore, we may want to warm up the Disk Cache before executing any test queries. This can be achieved by leveraging the so-called setUp Thread Group, which is executed only once before any other Thread Groups are run. The process involves the following steps:
CACHE SELECT * FROM lineitem;
Image 7: Disk Cache warm-up
When running the entire test plan, we can observe the queries hitting the Disk Cache by checking the “Bytes read from cache” metric in the Query History pane of the query.
Image 8: IO statistics when running a query on a warm cache
Now, let’s examine the following query.
select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1995-01-01'
and l_shipdate < date '1995-01-01' + interval '1' year
and l_discount between 0.08 - 0.01 and 0.08 + 0.01
and l_quantity < 25;
Note the filter on the ship date column (l_shipdate). Currently, the query uses a fixed internal. But what if we want to randomize the interval each time the query is executed?
To accomplish this, we can include a Random Variable configuration element to the respective JDBC request in our Test Plan.
Image 9: Random Variable parameters configuration
select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1995-01-01'
and l_shipdate < date '1995-01-01' + interval '${YEAR}' year
and l_discount between 0.08 - 0.01 and 0.08 + 0.01
and l_quantity < 25;
select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '${__RandomDate(,1995-01-01,2002-01-01,,sampleDate)}'
and l_shipdate < date '${sampleDate}' + interval '${YEAR}' year
and l_discount between 0.08 - 0.01 and 0.08 + 0.01
and l_quantity < 25;
Image 10: Test Plan with parameterized query
Now when we have implemented the entire workload in our Test Plan, it’s a good practice to parameterize it to enable quick changes of all parameters in a single location.
To do this, let’s introduce several User Defined Variables at the root Test Plan level and update the respective Test Plan items accordingly.
Image 11: Test Plan with User Defined Variables
By applying the parameterization technique, the JDBC connection string should look like the following:
jdbc:databricks://${HostName}:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=${HttpPath};ConnCatalog=${Catalog};ConnSchema=${Schema};use_cached_result=${UseCachedResult};
While the Thread Group settings should look like the following:
Image 12: Parameterized Thread Group
At this point, we can easily find and change all important configuration items at the root of the Test Plan. For example, we can easily:
The approach mentioned above with User Defined Variables works well when using JMeter’s GUI. However, in real testing scenarios, it is recommended to use the command line version of JMeter to run the tests while eliminating the impact of the GUI. Nevertheless, JMeter CLI supports what are known as ‘parameters’ which are not the same as the ‘variables’ we previously configured.
To utilize parameters, we need to change the way we configure User Defined Parameters. It should now follow the format below:
${__P(parameter_name,default_value)}
Here:
Therefore, the User Defined Variables for the Test Plan should appear as depicted in image 12.
Image 13: User Defined Variables using Parameters
Now we can specify configuration parameters directly in the command line. In the following example we specify two parameters - UseCachedResult and NumberOfUsers.
jmeter -n -t "Advanced Configuration.jmx" -JUseCachedResult=false -JNumberOfUsers=10 -l "Advanced Configuration log.csv"
When running JMeter from the command line, it is convenient to use the ‘user.properties’ file to provide parameter values in an easy-to-read key and value format within a text file. Building on the previous example, the file would look as follows:
HostName=******.azuredatabricks.net
HttpPath=/sql/1.0/warehouses/******
Catalog=samples
Schema=tpch
UseCachedResult=false
NumberOfThreads=5
LoopCount=100
Now, we can directly specify the properties file in the command line. This simplifies the process of conducting performance tests against multiple configurations by using different properties files.
jmeter -n -t "Advanced Configuration.jmx" -p user.properties
Please note the importance of the case sensitivity; ‘-p’ and ‘-P’ are two distinct command line options.
In conclusion, this blog post has delved into different approaches to implement performance-based test plans leveraging Apache JMeter capabilities when conducting basic and advanced performance test plans against Databricks SQL Warehouse workloads. This integration with commonly used open-source tools demonstrates the flexibility and adaptability of Databricks SQL Warehouse.
Data engineers and testers can meticulously tailor their test plans to replicate real-world scenarios, comprehensively evaluate system performance, and pinpoint potential bottlenecks using parameterization and user-defined variables.
JMeter's flexibility, inherent scalability, and seamless integration with Databricks SQL Warehouse position it as an optimal tool for experimenting with the robustness and efficiency of data-driven applications.
Furthermore, we have demonstrated how to implement test repeatability in Apache JMeter, allowing testers to run the same test scenarios consistently and observe how the system behaves over time.
Its built-in metrics and reporting features provide insightful data on various performance aspects, enabling data engineers to make informed decisions and continuously improve system performance.
In summary, Databricks SQL Warehouse's inherent scalability and seamless integration with widely used performance and testing tools tangibly denotes, through testing and experimenting, its robustness and efficiency for intense data-driven applications.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.