In the fast-paced world of big data, optimizing performance is critical for maintaining efficiency and reducing costs. Databricks SQL (DBSQL) Warehouse is a robust feature of the Databricks platform that enables data analysts, data engineers, and data scientists to perform SQL queries on large datasets efficiently and offers incredible capabilities. However, without proper tuning, execution times can lag, leading to wasted resources and slower insights.
To promote performance tuning, reduce execution time and improve utilisation, DBSQL provides Databricks Query Profile—a tool that provides deep insights into jobs. By leveraging Query Profile, data engineers and analysts can identify bottlenecks, optimize query performance, and enhance overall execution time. This article explores how harnessing the full potential of Databricks Query Profile can transform your workflows, leading to faster data processing and more timely decision-making.
We would like to focus on familiarising the users with DBSQL query profiling with examples and how it enables users to better understand their queries. Here we will
To effectively leverage the power of Databricks Query Profile, it's essential to understand the context in which it operates: the Databricks SQL Warehouse. Designed for high-performance data analytics, the SQL Warehouse provides a scalable environment that allows users to run complex queries against large datasets seamlessly.
DBSQL, or Databricks SQL, is a SQL-based interface offered by Databricks that enables users to run SQL queries directly against data. It provides a powerful way to analyze large datasets using familiar SQL syntax, making it accessible for data analysts and data scientists.
To read more about this offering please visit here. To understand more about SQL warehouses, please visit here.
In Databricks SQL (DBSQL), a Query Profile is a detailed report that provides insights into the execution of a specific SQL query. It helps users understand how their queries are performing and identifies potential bottlenecks or inefficiencies.
By analyzing the Query Profile, users can optimize their SQL queries for better performance and efficiency, ultimately leading to faster data processing and more effective data analysis.
You can use a query profile to visualize the details of a query execution. The query profile helps you troubleshoot performance bottlenecks during the query’s execution. For example:
Use DBSQL Query Profile if you’re focused on SQL query performance and need an accessible, user-friendly interface with visual insights.
Use Spark UI for a more comprehensive view of Spark applications, especially when working with complex jobs that involve various Spark operations beyond SQL.
**Additionally, Spark UI can be accessed if needed, through the option available at the top of the query. Please note that this option will not be available in Serverless.
Navigating to the Query Profile is straightforward. Users can access it directly through the Databricks UI. This can be accessed via:
The query history shows SQL queries performed using SQL warehouses. The query history holds query data for the past 30 days, after which it is automatically deleted.
You can get to query history by:
** If you’re a non-admin user without CAN_MANAGE permissions, you can only view your own queries in Query History.
Click on any query in the Query History list.
Let’s look at a simple example of how query profile can help with performance tuning.
For this use case - we will consider the below query running on Small SQL Warehouse scanning a Delta Table of around 2.07 GB’s with filter condition that returns only 3 rows.
SELECT *
FROM test.data.trip
WHERE hvfhs_license_num = 'HV0003' AND
PULocationID = 199
Here, when we first focus on the scanning operation, we see that 70.59M rows are getting scanned which indicates that the data skipping algorithm is not able to optimize scan. To deep dive into this scanning operation, we click on the Scan node in Graph View.
This gives us a clear picture that the algorithm is not able to prune the dataset properly thus resulting in large scan output. Now to solve this problem, let's optimize the table by applying zorder on relevant columns. If you expect a column to be commonly used in query predicates and if that column has high cardinality, use that column.
OPTIMIZE test.data.trip
ZORDER BY (hvfhs_license_num, PULocationID);
Now if we execute the same Select query, and check the profile.
We see a huge reduction in the time it takes to scan results (from 1.81m to 660 ms) due to better pruning, which we can clearly see in “Bytes pruned” metric, resulting in a much more efficient filter and overall runtime.
You can either look at the operations in graph view or tree view that will help you get a high level idea. It indicates the time taken by each operation and other useful details as well.
Please note that you can view a much more detailed list of metrics by enabling Verbose mode as shown below.
As discussed in the Query History section, non-admin users are required to have CAN_MANAGE permissions on the warehouse to view its query history. In cases where it is required to share the query profile with users without sharing the access, Query profile provides options to export and import the profile dashboard as json.
To download the sharable json version for your query profile, simply select the query in the query history page and select Download from the options on the top left as shown below.
This json can be imported by other users in their query history page. To import a json query profile, simply select Import query profile (JSON) from the options on top left of the query history page as shown below. Select the json file and the query profile will be imported.
Databricks offers an analytical store within the system schema that allows for historical observability across your account. For more details on system tables, check out here. Be sure to review the Requirements section to understand how to enable system tables.
In our scope specifically, we want to discuss one of the system tables that helps us track records of all queries run on SQL warehouses which is system.query.history. It is available in public preview and includes account-wide records from all workspaces in the same region from which you access the table. The schema of this table is discussed in more detail here.
Overall this provides detailed information about the executed queries and can be used to dive in deep by analyzing query history, identifying usage patterns and trends, analyzing capacity planning and resource allocation or define detailed metrics to track performance across a group of queries.
Additionally, Databricks offers this information through the List Queries API. For more insights, please refer to this link.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.