Databricks Community

MarcoScagliola · ‎04-17-2024

Authors: Andrey Mirskiy (@AndreyMirskiy) and Marco Scagliola (@MarcoScagliola)

Welcome to the fourth part (#4) of our blog series on “Why Databricks SQL Serverless is the best fit for BI workloads”.

In the previous blog posts we have covered the following topics:

In this blog post, we will discuss the Remote Query Result Cache (Remote QRC) feature. It is designed to enhance the performance of Databricks SQL Serverless Warehouses by accelerating the execution of repetitive queries and storing their results on remote storage.

Databricks SQL Serverless is designed to scale based on actual workload, ensuring cost-effectiveness by avoiding over-provisioning resources when they are not needed while maintaining high performance during peak demand by dynamically adding resources. However, when a cluster is resized or a new cluster is added to a warehouse, an empty query result cache leads to lower performance. Remote Query Result Cache aims to address this issue. It ensures that the performance of repetitive queries is not compromised, providing a seamless and efficient user experience.

Remote Query Result Cache Overview

This serverless-only caching technique, known as the Remote Result Cache, significantly improves query performance by storing results in dedicated cloud storage. Unlike in-memory query result caching, which only retains query results as long as the compute resources are active, the Remote Result Cache provides a persistent and shared cache. This cache is accessible across all serverless clusters and SQL Warehouses within a Databricks Workspace, effectively overcoming the common challenge of maintaining query result availability. It ensures that cached results are persistently accessible, independent of the state of the compute resources.

Demo Scenario

We have developed a scenario to showcase the Remote QRC feature. A key prerequisite for this demonstration is the use of a Serverless SQL Warehouse. In image 1, we can see the JMeter UI detailing the test plan steps designated to demonstrate this Serverless-only feature.

Image 1: JMeter configuration pane for the Remote QRC test case. Image 1: JMeter configuration pane for the Remote QRC test case.

In the first step, we configured a SQL Warehouse to Small cluster size as a preparation step.

Next, we executed a sample query on this small SQL Warehouse. Because this is the very first query execution and the Warehouse has just started, the engine is unable to leverage previously cached results. The overall duration of the query was 6 seconds and 179 milliseconds, as illustrated in Image 2. Being this was a 'cold run' with no result in the cache, please note that the IO metrics in Image 3 are not empty, indicating that data was retrieved from storage to fulfill our query.

Image 2: Wall-clock duration metrics with Small Serverless SQL Warehouse, ‘cold run’. Image 2: Wall-clock duration metrics with Small Serverless SQL Warehouse, ‘cold run’.

Image 3: Query execution metrics with Small Serverless SQL Warehouse, ‘cold run’. Image 3: Query execution metrics with Small Serverless SQL Warehouse, ‘cold run’.

In the next step of our testing, we restarted the Databricks SQL Serverless Warehouse to ensure that the local cache was emptied, and all query results were removed. Then, we reran the same query.

The running metrics, illustrated in Image 4, recorded an overall duration of 4 seconds and 371 milliseconds for the query to be executed. Moreover, a significant observation from the IO metrics shown in Image 5 is that they are empty, meaning that the result was fetched from the cache instead of actually executing the query. This outcome highlights the efficiency of leveraging remote query result cache for query results even after a cluster restart.
Image 4: Wall-clock duration metrics with Small Serverless SQL Warehouse after warehouse restart. Image 4: Wall-clock duration metrics with Small Serverless SQL Warehouse after warehouse restart.
Image 5: Query execution metrics with Small Serverless SQL Warehouse after warehouse restart. Image 5: Query execution metrics with Small Serverless SQL Warehouse after warehouse restart.

In the next step of our test, we resized the Serverless SQL Warehouse from Small to Medium. Then, we reran the same query, once again. As shown in Image 6, the overall duration of the query took 4 seconds and 71 milliseconds.

Notably, the IO metrics presented in Image 7 were empty; this indicates that the engine retrieved the result from the remote query result cache without actually executing the query.

As we can observe Databricks SQL Serverless is capable of leveraging cached results even after resizing the SQL Warehouse.

Image 6: Wall-clock duration metrics after resizing Serverless SQL Warehouse to Medium Image 6: Wall-clock duration metrics after resizing Serverless SQL Warehouse to Medium
Image 7: Query execution metrics after resizing Serverless SQL Warehouse to Medium. Image 7: Query execution metrics after resizing Serverless SQL Warehouse to Medium.

Ultimately, the final test clearly demonstrates that Databricks SQL Serverless harnesses the Remote Query Result Cache (QRC) feature, which is communal across clusters and Warehouses. This feature efficiently stores and shares query results, ensuring their availability even when a SQL Warehouse experiences resizing, which is very typical for high-concurrency BI workloads.

This feature not only enhances query results retrieval efficiency but also ensures consistent performance across various operational scenarios, showcasing the robustness of Databricks SQL Serverless Warehouse in managing BI workloads seamlessly.

Conclusion

In conclusion, the Remote Query Result Cache (QRC) feature within Databricks SQL Warehouse Serverless presents a transformative approach to how the cluster enhances data retrieval efficiency and ensures consistent performance for BI workloads.

Through a series of repeatable tests, it has been demonstrated that the Remote QRC not only enhances the efficiency of query execution by leveraging cached results but also ensures that these efficiencies are maintained across different operational scenarios, including cluster resizing and restarts. This feature effectively minimizes reliance on disk I/O and cloud operations, thereby accelerating data retrieval times and ensuring that BI workloads can be processed more rapidly and with greater cost efficiency.

By sharing cached results among clusters and warehouses, Databricks SQL Serverless facilitates a seamless, scalable, and more resilient data architecture. This capability is indispensable for organizations looking to optimize their data analytics pipelines.

The implications of these findings underscore the importance of embracing serverless architectures that incorporate intelligent caching mechanisms to drive analytics innovation and operational excellence.

For more information on Remote Result Caching, refer to the documentation (Azure | AWS | GCP).

Resources

The JMeter artifacts featured in this blog are available in the following GitHub repository.

Databricks Community

Why Serverless Databricks SQL is the best for BI workloads: Part IV - Remote Query Result Cache

Remote Query Result Cache Overview

Demo Scenario

Conclusion

Resources

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL