Databricks Community

nayan_wylde · ‎08-25-2025

Optimizing queries in Databricks isn’t just about adding indexes or tweaking SQL syntax — it’s about visibility. You can’t improve what you can’t measure. Fortunately, Databricks provides rich telemetry around query history that you can use to analyze performance, identify bottlenecks, and guide tuning efforts.

In this post, I’ll walk through how you can track query history, gather key statistics, and use them to systematically optimize workloads in Databricks.

Why Track Query History?

Without query history, optimization is reactive: you only troubleshoot when a job is failing or a warehouse is lagging. Query history provides you with:

Performance baselines: Know the average runtime, rows scanned, and bytes processed.
Bottleneck detection: Identify queries that consistently take longer or consume disproportionate resources.
Trend analysis: Spot degradation over time as data grows.
Cost optimization: Track which queries consume the most DBUs and storage I/O.

Accessing Query History in Databricks

Databricks exposes query history through:

System Tables (Recommended)
Databricks System Tables include system.query.history and related views that capture query metadata, performance stats, error.

%sql
SELECT 
    statement_text,
    executed_as,
    start_time,
    total_duration_ms,
    produced_rows
FROM system.query.history
WHERE update_time >= CURRENT_DATE() - INTERVAL 7 DAYS
ORDER BY total_duration_ms DESC
LIMIT 10;

2. Query History API
For programmatic access or integration with monitoring tools, Databricks provides REST APIs to fetch query history. These are useful if you want to stream metrics into CloudWatch, Datadog etc.

import requests
import json

workspace_url = ""
pat = ""

uri = f"https://{workspace_url}/api/2.0/sql/history/queries"
headers = {"Authorization": f"Bearer {pat}"}

# Example filter for queries finished within a specific time range
request_body = {
    "filter_by": {
        # "query_start_time_range": {
        #     "start_time_ms": 11755751878, # Example timestamp (milliseconds)
        #     "end_time_ms": 11755834678
        # },
        "statuses": ["FINISHED","FAILED"]
    },
    "max_results": 100
}

response = requests.get(uri, headers=headers, data=json.dumps(request_body))

if response.status_code == 200:
    query_history = response.json()
    # Process the query_history data
    print(json.dumps(query_history, indent=4))
else:
    #print(f"Error: {response.status_code} - {response.text}")
    print("error")

Programmatic Analysis with System Tables

For scalable insights, query the system.query_history table in Unity Catalog, which logs metrics for SQL warehouse and serverless queries.

%sql
SELECT statement_id, statement_text, total_duration_ms
FROM system.query.history
WHERE execution_status = 'FINISHED'
ORDER BY total_duration_ms DESC
LIMIT 10;

-- Average query duration by user
SELECT executed_by, AVG(total_duration_ms) AS avg_duration_ms
FROM system.query.history
GROUP BY executed_by
ORDER BY avg_duration_ms DESC;

-- Queries with high disk spill
SELECT statement_id, statement_text, spilled_local_bytes
FROM system.query.history
WHERE spilled_local_bytes > 1000000000  -- Over 1GB
ORDER BY spilled_local_bytes DESC
LIMIT 5;

Optimizing Queries with Table Statistics

Databricks’ cost-based optimizer (CBO) uses table statistics to select efficient execution plans, especially for joins. Collect stats with:

%sql
ANALYZE TABLE your_table COMPUTE STATISTICS FOR ALL COLUMNS;

For Unity Catalog Delta tables, enable predictive optimization to automate stats collection. Verify with:

%sql
EXPLAIN SELECT * FROM your_table;

Improving Data Layout with OPTIMIZE and Liquid Clustering

Delta Lake’s OPTIMIZE command compacts small files and enhances data skipping for faster queries:

%sql
OPTIMIZE your_delta_table
WHERE date_partition = '2025-08-20'
ZORDER BY (column1, column2);

This clusters data by column1 and column2, improving filters and joins. For full-table optimization (Databricks Runtime 16.0+):

%sql
OPTIMIZE your_delta_table FULL;

Set max file size:

spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 104857600)  # 100MB

Introducing Liquid Clustering

Liquid clustering, a powerful Delta Lake feature, offers a flexible alternative to traditional partitioning and Z-ordering. Unlike static partitions, liquid clustering dynamically organizes data based on clustering columns, adapting to changing query patterns without requiring rewrites. It’s ideal for high-cardinality columns or evolving workloads.

To enable liquid clustering when creating a table:

%sql
CREATE TABLE your_delta_table (
  id BIGINT,
  event_type STRING,
  event_date DATE
)
USING DELTA
CLUSTER BY (event_type, event_date);

For existing tables, enable liquid clustering (note: this is a one-way operation):

%sql
ALTER TABLE your_delta_table CLUSTER BY (event_type, event_date);

Run OPTIMIZE to apply clustering:

%sql
OPTIMIZE your_delta_table;

Benefits of Liquid Clustering:

Dynamic Adaptation: Adjusts data layout as query patterns change, unlike Z-ordering’s static approach.
Improved Data Skipping: Enhances pruning for queries filtering on clustering columns.
Simplified Management: Eliminates partition maintenance for high-cardinality columns.

Use Case: If queries frequently filter on event_type and event_date, liquid clustering ensures efficient data access without manual partition tuning. Check query profiles to confirm reduced rows scanned after enabling.

Best Practices:

Choose up to four clustering columns based on common filters or joins.
Run OPTIMIZE periodically or enable predictive optimization for auto-maintenance.
Use DESCRIBE DETAIL your_delta_table to verify clustering columns.

Liquid clustering shines in scenarios with dynamic workloads, complementing OPTIMIZE for maximum performance.

Building Dashboards for Query Monitoring

Instead of running ad-hoc checks, consider creating a Lakehouse monitoring dashboard:

Top N slowest queries by runtime.
Top N most expensive queries by DBU cost or bytes scanned.
Query trends (average runtime per user/team).
Failure heatmap (when and how often queries fail).

You can build this dashboard in Databricks SQL (Lakeview) or forward metrics to BI tools like Tableau, Power BI, or monitoring systems like Datadog.

Databricks Community

Tracking Query History and Optimizing Queries in Databricks

Why Track Query History?

Accessing Query History in Databricks

Programmatic Analysis with System Tables

Optimizing Queries with Table Statistics

Improving Data Layout with OPTIMIZE and Liquid Clustering

Introducing Liquid Clustering

Building Dashboards for Query Monitoring

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples