Databricks Community

Abiola-David · yesterday

As data engineering teams scale out lakehouses and cloud data warehouses, a silent platform killer inevitably creeps in: runaway query costs.

In a distributed environment like Databricks, a single unoptimized query whether it’s an accidental Cartesian product, a missing filter condition, or a massive table scan on an un-indexed dataset can run for hours, quietly burning through compute resources and spiking your cloud bill.

To build a high-performance, cost-effective data platform, proactive governance isn't just a "nice-to-have"; it is a core responsibility. Fortunately, if you are operating within the Databricks ecosystem, the system itself provides the exact tools you need to hunt down these inefficient "giants."

The 5-Minute Warning: Identifying Long-Running Queries

Instead of waiting for the monthly billing alert to realize something is wrong, you can proactively audit your cluster usage. The following SQL query queries Databricks The 5-Minute Warning: Identifying Long-Running Queries
Instead of waiting for the monthly billing alert to realize something is wrong, you can proactively audit your cluster usage. The following SQL query queries Databricks system.query.history to immediately isolate any query that has been executing for longer than 5 minutes (300,000 milliseconds), sorted by the heaviest offenders: to immediately isolate any query that has been executing for longer than 5 minutes (300,000 milliseconds), sorted by the heaviest offenders:

SELECT
    statement_id,
    executed_by,
    total_duration_ms/1000 AS DurationSeconds,
    statement_text
FROM system.query.history
WHERE total_duration_ms > 300000
ORDER BY total_duration_ms DESC;

Why This Audit Matters for Data Engineering Teams

1. Financial Governance & Cost Optimization

Cloud compute is elastic, which is both a blessing and a curse. If a bad query runs continuously, the system will happily keep charging you for it. By isolating queries that exceed a 5-minute threshold, you can identify which workloads are draining your budget and address them before they compound over days or weeks.

2. Pinpoint Accountability (Who vs. What)

The executed_by field is incredibly powerful. It allows you to differentiate between:

Ad-hoc user queries: A data scientist or analyst running an intensive exploratory query without proper partitioning limits.
Automated pipelines: A scheduled dbt or Delta Live Tables job that has degraded in performance due to data volume growth. Knowing who or what triggered the query allows you to provide targeted feedback or fix the underlying pipeline logic directly.

3. Precision Performance Tuning

Once you grab the statement_text of a bottleneck query, you can look at its Spark UI query plan to apply specific optimization strategies:

Z-Ordering / Liquid Clustering: If the query is doing massive scans, ensuring the data is co-located by high-frequency filter columns will drastically reduce I/O.
Join Optimization: Checking if a shuffle-hash join can be optimized into a broadcast join to mitigate data skew.
Incremental Processing: Evaluating if the logic can be converted to Structured Streaming or incremental loads rather than re-processing full tables.

Building a Culture of Observability

Running this query ad-hoc is a great first step, but the ultimate goal for any DataOps or Platform Engineering team should be automation. Consider building a simple dashboard on top of this system table or setting up an automated alert that pings your team's Slack or Teams channel whenever an ad-hoc query crosses a specific duration threshold.

Keeping your data platform lean, fast, and cost-effective doesn't require magic—it just requires looking at the history your system is already writing for you.

#DataEngineering #Databricks #SQL #DataPlatform #CloudOptimization #BigData #DataOps #ApacheSpark