Databricks Community

rakhidarshi · ‎11-13-2024

Introduction

In the fast-paced world of big data, optimizing performance is critical for maintaining efficiency and reducing costs. Databricks SQL (DBSQL) Warehouse is a robust feature of the Databricks platform that enables data analysts, data engineers, and data scientists to perform SQL queries on large datasets efficiently and offers incredible capabilities. However, without proper tuning, execution times can lag, leading to wasted resources and slower insights.

To promote performance tuning, reduce execution time and improve utilisation, DBSQL provides Databricks Query Profile—a tool that provides deep insights into jobs. By leveraging Query Profile, data engineers and analysts can identify bottlenecks, optimize query performance, and enhance overall execution time. This article explores how harnessing the full potential of Databricks Query Profile can transform your workflows, leading to faster data processing and more timely decision-making.

What are we going to discuss?

We would like to focus on familiarising the users with DBSQL query profiling with examples and how it enables users to better understand their queries. Here we will

Briefly touch on DBSQL and Query profile
Navigate to query profile for an executed query
Discuss a simple performance tuning use case
Explore profile metrics

Introducing Databricks SQL

To effectively leverage the power of Databricks Query Profile, it's essential to understand the context in which it operates: the Databricks SQL Warehouse. Designed for high-performance data analytics, the SQL Warehouse provides a scalable environment that allows users to run complex queries against large datasets seamlessly.

DBSQL, or Databricks SQL, is a SQL-based interface offered by Databricks that enables users to run SQL queries directly against data. It provides a powerful way to analyze large datasets using familiar SQL syntax, making it accessible for data analysts and data scientists.

To read more about this offering please visit here. To understand more about SQL warehouses, please visit here.

Introducing Query Profile

In Databricks SQL (DBSQL), a Query Profile is a detailed report that provides insights into the execution of a specific SQL query. It helps users understand how their queries are performing and identifies potential bottlenecks or inefficiencies.

Key components of a Query Profile include:

Execution: This outlines how the query was executed, including the stages and the sequence of operations performed and execution time for each.
Metrics: This includes important metrics such as total execution time, CPU time, I/O time, and shuffle time, allowing users to gauge resource usage.
Caching Information: Details about whether any data was cached during execution are included, which can illustrate performance improvements.
Stage Breakdown: A breakdown of various stages of the query execution, showing how much time was spent in each stage in different views.
Process details: Details about spilling, files and partitions used, pruning and many more.

By analyzing the Query Profile, users can optimize their SQL queries for better performance and efficiency, ultimately leading to faster data processing and more effective data analysis.

How does Query profile help?

You can use a query profile to visualize the details of a query execution. The query profile helps you troubleshoot performance bottlenecks during the query’s execution. For example:

You can visualize each query task and its related metrics, such as the time spent, number of rows processed, rows processed, and memory consumption.
You can identify the slowest part of a query execution at a glance and assess the impacts of modifications to the query.
You can discover and fix common mistakes in SQL statements, such as exploding joins or full table scans.

How is Query profile different from Spark UI?

Use DBSQL Query Profile if you’re focused on SQL query performance and need an accessible, user-friendly interface with visual insights.

Use Spark UI for a more comprehensive view of Spark applications, especially when working with complex jobs that involve various Spark operations beyond SQL.

**Additionally, Spark UI can be accessed if needed, through the option available at the top of the query. Please note that this option will not be available in Serverless.

Accessing Databricks Query Profile

Navigating to the Query Profile is straightforward. Users can access it directly through the Databricks UI. This can be accessed via:

Query History:

The query history shows SQL queries performed using SQL warehouses. The query history holds query data for the past 30 days, after which it is automatically deleted.

You can get to query history by:

Clicking on Query History in the sidebar - you can filter on metrics like User, compute, time period etc.

Through the SQL Warehousing monitoring screen.
- Click on Running queries number or Queued queries number.

** If you’re a non-admin user without CAN_MANAGE permissions, you can only view your own queries in Query History.

Query execution notebook:

Run Your Query: Begin by executing your SQL query in notebook.
Click on “See performance” and then on your query.

View a Query Profile:

Click on any query in the Query History list.

Brief information about a query’s performance appears, such as time spent in each task, rows returned, and I/O performance.
For more detailed information about the query’s performance, including its execution plan, click View Query Profile at the bottom of the page.

Profiling Queries Effectively

Let’s look at a simple example of how query profile can help with performance tuning.

For this use case - we will consider the below query running on Small SQL Warehouse scanning a Delta Table of around 2.07 GB’s with filter condition that returns only 3 rows.

SELECT *
FROM test.data.trip
WHERE hvfhs_license_num = 'HV0003' AND
     PULocationID = 199

Here, when we first focus on the scanning operation, we see that 70.59M rows are getting scanned which indicates that the data skipping algorithm is not able to optimize scan. To deep dive into this scanning operation, we click on the Scan node in Graph View.

This gives us a clear picture that the algorithm is not able to prune the dataset properly thus resulting in large scan output. Now to solve this problem, let's optimize the table by applying zorder on relevant columns. If you expect a column to be commonly used in query predicates and if that column has high cardinality, use that column.

OPTIMIZE test.data.trip
ZORDER BY (hvfhs_license_num, PULocationID);

Now if we execute the same Select query, and check the profile.

We see a huge reduction in the time it takes to scan results (from 1.81m to 660 ms) due to better pruning, which we can clearly see in “Bytes pruned” metric, resulting in a much more efficient filter and overall runtime.

Exploring the profile metrics

Visual representation:

You can either look at the operations in graph view or tree view that will help you get a high level idea. It indicates the time taken by each operation and other useful details as well.

Wall-clock duration breakdown:

Wall-clock duration - The elapsed wall-clock time between the start of scheduling and the end of query execution.
Waiting for Compute - Time spent waiting for compute resources to be provisioned, e.g., due to SQL warehouse creation, resume, or resize.
Waiting In Queue - Time spent waiting for available compute capacity; all provisioned clusters were busy and could not take more queries
Optimizing query & pruning files - Time spent improving query performance with data skipping, query compilation, and security checks.
Executing - The time spent on execution, including time spent scanning source data.
Result fetching by client -The time it took the client that initiated the execution request to retrieve the results.

Top level profile metrics

Tasks total time - This represents the total time spent executing all tasks for the query. It is the sum of the execution times of all individual tasks.
Tasks time in Photon - This indicates the percentage of the total task time that was executed using Photon, Databricks' native vectorized query engine designed to accelerate query performance.
Rows returned - The total number of rows returned by the query. In this case, only one row was returned.
Rows read - The total number of rows read from the data source(s) during the execution of the query.
Bytes read - The total amount of data read from the data source(s) in megabytes.
Bytes read from cache - The percentage of data read from the cache. A value of 100% indicates that all the data read was fetched from the cache, which can significantly speed up query performance.
Bytes written - The total amount of data written to storage during the execution of the query. In this case, no data was written.
Files read - The number of files read from the data source(s) during the execution of the query.
Partitions read - The number of partitions read from the data source(s). In this case, no partitions were read.
Bytes spilled to disk - The amount of data temporarily written to disk during the execution of the query. Spilling occurs when there is not enough memory to hold intermediate data. In this case, no data was spilled to disk.

Operator level profile metrics

Time spent - The total time taken to complete the scan operation.
Rows - The total number of rows in the table being scanned.
Cache misses size - The amount of data that was not found in the cache and had to be read from the underlying storage. Here, 2.07 GB of data was read from storage because it was not in the cache.
Cloud storage request duration - The total time spent on making requests to cloud storage.
Filesystem read data size - The amount of data read from the filesystem. This is zero, indicating that all data was read from cloud storage rather than a local filesystem.
Metadata time - The time spent on reading metadata. In this case, no time was spent on metadata operations.
Scan time - The total time taken to perform the scan operation, which matches the overall time spent.
Size of files read - The total size of all files read during the scan.

Please note that you can view a much more detailed list of metrics by enabling Verbose mode as shown below.

Sharing query profiles:

As discussed in the Query History section, non-admin users are required to have CAN_MANAGE permissions on the warehouse to view its query history. In cases where it is required to share the query profile with users without sharing the access, Query profile provides options to export and import the profile dashboard as json.

To download the sharable json version for your query profile, simply select the query in the query history page and select Download from the options on the top left as shown below.

This json can be imported by other users in their query history page. To import a json query profile, simply select Import query profile (JSON) from the options on top left of the query history page as shown below. Select the json file and the query profile will be imported.

Query history system table:

Databricks offers an analytical store within the system schema that allows for historical observability across your account. For more details on system tables, check out here. Be sure to review the Requirements section to understand how to enable system tables.

In our scope specifically, we want to discuss one of the system tables that helps us track records of all queries run on SQL warehouses which is system.query.history. It is available in public preview and includes account-wide records from all workspaces in the same region from which you access the table. The schema of this table is discussed in more detail here.

Overall this provides detailed information about the executed queries and can be used to dive in deep by analyzing query history, identifying usage patterns and trends, analyzing capacity planning and resource allocation or define detailed metrics to track performance across a group of queries.

Additionally, Databricks offers this information through the List Queries API. For more insights, please refer to this link.