Databricks Serverless SQL (DBSQL) is the latest offering from Databricks to build data warehouses on the Lakehouse. It incorporates all the Lakehouse features like open format, unified analytics, and collaborative platforms across the different data personas within an organisation. In the last couple of quarters, we have seen tremendous growth and adoption of DBSQL with a lot of customers migrating from different cloud data warehouses to DBSQL and one of the keys to their success is the server-less nature of DBSQL which provides instant and elastic compute, lower total cost of ownership and management overhead. You can further improve your experience with Databricks SQL Serverless through optimisations and performance tuning applied by different data personas. These can be broadly categorised into:
In this blog post, I will share the Top 10 query performance tuning tips that Data Analysts and other SQL users can apply to improve DBSQL Serverless performance.
SELECT * FROM my_table will retrieve all the columns and rows from the given table and it is an Input-Output intensive operation for a large table. Usually, we use this simple query just to identify the column names of the table or to profile sample data. This naive approach can cause bottlenecks in the query execution with growing numbers of users simultaneously trying to profile the tables in this fashion. It is always recommended to select only the required columns for the analysis instead of fetching all the columns. You can use the Data Explorer option in the Databricks console to profile the table for the column details and analyse the sample data set as well. This will help you to avoid doing SELECT * on tables.
LIMIT - The LIMIT clause is used to restrict the number of rows returned by a query. We mostly use it for data profiling or retrieving arbitrary subsets of data. It specifies the maximum number of rows that should be retrieved from the database result. DBSQL Query Editor by default adds LIMIT 1000 to all the queries. Limiting the number of rows returned by a query can improve the performance of the database. By retrieving only the necessary subset of rows, the database engine can avoid unnecessary processing and reduce the amount of data transferred over the network. Large result sets can consume a significant amount of memory, especially if they are not needed in their entirety. By using LIMIT, you can control the amount of memory required to hold the query result.
FILTER - The WHERE clause enables you to select specific rows from a table based on certain criteria. By specifying filters, you can retrieve only the data that meets the desired conditions, making your query results more relevant and meaningful. You can also leverage the capabilities of Z-Ordering to reduce your data scan and thus improve query performance. Z-ORDER is particularly beneficial for workloads that involve range-based queries, such as filtering data based on a specific date range, numerical ranges, or other column-based ranges. For more detailed information on Z-ORDER, please visit the documentation.
When designing a database table, choosing between integer and string data types depends on several factors, including the nature of the data, the usage patterns, and the specific requirements of your application. A common choice made by customers is to opt for HASH function based alphanumeric keys because they are deterministic. But if you think from a storage perspective, the alphanumeric keys are of string data type and that is a lot of storage as compared to integer type. This can slow down the data import to your BI tools and thus degrading the report performance.
Here is an example of the NYC Taxi Trip Data with 100+ million rows. I have two tables: nyc_taxi_trp_detail_string with TripID as string and nyc_taxi_trp_detail_integer with TripID as integer. The size of the table with an integer key is around 3GB whereas the one with a string key is an astonishing 7GB.
Ultimately, the choice between integer or string columns depends on the specific requirements and characteristics of your data and application. In some cases, a combination of both types may be appropriate, such as using an integer surrogate key for internal referencing and a string natural key for external interactions. It's important to consider factors like performance, data integrity, integration needs, and the meaningful representation of your data when making this decision.
Caching in DBSQL can significantly improve the performance of iterative or repeated computations by reducing the time required for data retrieval and processing. Caching allows you to avoid redundant computations by reusing previously calculated data, thereby reducing the overall processing time of your queries or operations. However, it's important to use caching judiciously and consider the memory requirements of your workload to avoid excessive memory usage or potential out-of-memory issues. Databricks offers a wide variety of caching mechanisms and DBSQL leverages those features.
As an SQL user, you can cache the data as a SELECT statement for the most frequently used tables or queries in the Disk Cache.
CACHE SELECT * FROM boxes;
CACHE SELECT width, length FROM boxes WHERE height=3;
You don’t need to use this command for the disk cache to work correctly, the data will be cached automatically when first accessed. But it can be helpful when you require consistent query performance.
This construct is applicable only to Delta tables and Parquet tables.
CTE (Common Table Expressions) and subqueries are both powerful techniques in SQL for expressing complex queries and organising data. While they can achieve similar results, subqueries can sometimes be less performant than CTEs, especially when used in complex queries or with large datasets. The optimiser may choose different query execution plans for subqueries compared to CTEs. Depending on the specific query, the optimiser may or may not treat subqueries and CTEs equivalently in terms of optimisation and performance. In addition to that, CTEs can improve the readability and maintainability of complex queries by breaking them down into smaller, more manageable parts. They provide a way to give meaningful names to intermediate result sets, making the query logic easier to understand. CTEs can be referenced multiple times within the same query, enabling code reuse and avoiding the need to repeat complex subqueries. This can lead to more concise and modular queries.
Photon is a new vectorised query engine on Databricks developed in C++ to take advantage of modern hardware and is compatible with Apache Spark APIs. DBSQL uses Photon by default which accelerates the query execution that processes a significant amount of data and includes aggregations and joins. Most of the DBSQL native functions are supported by Photon. You can read more about the scope of Photon in the coverage documentation. User Defined Functions (UDF) are not photonizable yet, be sure to explore all native photonizable functions before writing any UDF.
Join hints allow users to explicitly suggest the join strategy that the DBSQL optimiser should use. When different join strategy hints are specified on both sides of a join, Databricks SQL prioritises hints in the following order: BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL.
When both sides are specified with the BROADCAST hint or the SHUFFLE_HASH hint, Databricks SQL picks the build side based on the join type and the sizes of the relations. Since a given strategy may not support all join types, Databricks SQL is not guaranteed to use the join strategy suggested by the hint.
Tip 8 - Run ANALYZE, OPTIMIZE and VACUUM
ANALYZE collects the statistics of a specified table. These statistics are used by the DBSQL query optimiser to generate a better execution plan.
ANALYZE TABLE nyc_taxi_trp_detail_integer COMPUTE STATISTICS;
OPTIMIZE statement optimizes the layout of the Delta Lake data files. Optionally you can Z-Order the data to colocate the related data in the same set of files.
OPTIMIZE nyc_taxi_trp_detail_integer;
OPTIMIZE nyc_taxi_trp_detail_integer ZORDER BY (TripID);
VACUUM removes unused files from the table directory. It removed all the data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days. If you run Vacuum on a delta table, you lose the ability to time travel back to the older versions of the table but it helps in reducing the storage cost.
VACUUM nyc_taxi_trp_detail_integer DRY RUN;
VACUUM nyc_taxi_trp_detail_integer;
It is very important to explore and analyse your query plan before executing it. This enables you to understand how the code will actually be executed and is useful for optimising your queries. The DBSQL optimiser automatically generates the most optimised physical plan which is then executed.
EXPLAIN [EXTENDED|CODEGEN|COST|FORMATTED]
SELECT
VendorID, count(TripID) AS TotalTrip
FROM
nyc_taxi_trp_detail_integer
GROUP BY VendorID;
DBSQL also offers the feature of Adaptive Query Execution (AQE) which is a re-optimisation of the query plan that happens during the query execution. This can be very useful when statistics collection is not turned on or when statistics are stale. It is also useful in places where statically derived statistics are inaccurate, such as in the middle of a complicated query, or after the occurrence of data skew.
Last but not least, formatting your SQL queries is always a best practice to follow. It won’t improve the query performance but a well-formatted code can definitely improve your efficiency in understanding someone else's code, debugging, refactoring or building up a new business logic. Some of the best practices are:
Implementing these best practices enhances query processing speed and also contributes to efficient resource utilisation, scalability, and overall system stability, enabling organizations to handle growing data volumes and user demands.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.