Understanding what impacts "Optimizing query & pruning files" time

Rennzie — Thu, 04 Dec 2025 18:29:30 GMT

Hi everyone.

I'm keen to understand what impacts the "Optimizing query & pruning files" times seen in the Query Profile History for a sQL Warehouse query. We're querying against delta tables and cannot find any information about what or when a query will have a long complication and optimisation time. From observation it appears to be random and we'd like a way to mitigate it's effects.

Any help is much appreciated.

Re: Understanding what impacts "Optimizing query & pruning files" time

Louis_Frolio — Thu, 04 Dec 2025 20:38:22 GMT

Hello @Rennzie , think of the “Optimizing query & pruning files” step as the warm-up routine before the warehouse starts lifting any real weights. In this window, the engine is lining up the play: skipping irrelevant files, compiling the plan, and conducting the necessary security checks before we ever touch the underlying data.

If you check the Query Profile, Databricks separates this phase cleanly from scheduling and execution, so when you see spikes here, you’re looking at pre-scan work — not the scan itself.

Why the duration jumps around

A few patterns consistently show up in the wild:

Cold vs. warm cache

After a warehouse wakes up from suspension, those first couple of queries stretch their legs. Metadata is cold, the cache is empty, and the warehouse needs a moment before it behaves like its usual self. After a few runs, everything tightens up.
Data skipping job scheduling

If the system decides it needs to run a PrepareDeltaScan job to trim the file list, the scheduling delay alone can dominate this phase — especially when the warehouse is already juggling other work. This is why the timing can feel random across identical runs.
Delta Lake metadata size and file fragmentation

Huge logs and a jackpot of tiny files slow down metadata reads and skew pruning times. Teams routinely see this improve immediately after a round of compaction and checkpointing.
Layout alignment

If your data isn’t physically laid out to support your predicates, pruning becomes less effective — and the optimization phase spends more time figuring out what it can safely ignore.

What you can do about it

Here are the levers that consistently move the needle:

Warm the warehouse

Run a lightweight filter on your busiest tables after a restart. This primes caches, metadata, and remote result paths. Remember: the remote result cache sticks around for 24 hours, but only for unchanged tables and exact query matches.
Reduce file and metadata bloat

OPTIMIZE + VACUUM is still the gold standard. If you lean heavily on certain filters, Z-ORDER or Liquid Clustering will improve locality and make pruning much more decisive.
Manage concurrency

If the warehouse is swamped, the pre-scan phase gets stuck in traffic. Consider bumping warehouse size or adjusting autoscaling if your workload bursts unpredictably.
Align filters with layout

Target partition columns or clustering keys. If your WHERE clause isn’t aligned with how the data is arranged, the engine has no choice but to work harder up front.
Let the Query Profile guide you

If “PrepareDeltaScan” dominates the Top rules section, that’s metadata overhead waving its hands. Focus your fixes there.

A quick diagnostic pass

If you want to sanity-check your environment fast:

Compare a “slow” vs “fast” Query Profile. Look at files pruned, bytes read, and which rules are taking the most time.
Check whether the warehouse had just restarted — cold starts change everything.
Observe concurrency: autoscaling firing? Several queries landing at once?
Inspect table health: file counts, partition spread, last OPTIMIZE/VACUUM run.
Validate predicate alignment with your partitioning or clustering strategy.

Hope this sheds some light on the matter.

Cheers, Louis.

Re: Understanding what impacts "Optimizing query & pruning files" time

Rennzie — Fri, 05 Dec 2025 16:03:33 GMT

Hey @Louis_Frolio, thank you for the detailed response - really useful.

We'll work through what looks most relevant to us.

Question 1: Can you confirm of the `compilation_time_ms` in the metrics from the list query endpoint corresponds to the “Optimizing query & pruning files” time on the query history page.

Question 2: Is there a way to see when autoscaling is firing? Is that simply looking at the "Running cluster" on the warehouse monitoring page?

PS: It would be great to see your explanation in the docs somewhere.

topic Re: Understanding what impacts "Optimizing query & pruning files" time in Warehousing & Analytics

Understanding what impacts "Optimizing query & pruning files" time

Re: Understanding what impacts "Optimizing query & pruning files" time

Why the duration jumps around

What you can do about it

A quick diagnostic pass

Re: Understanding what impacts "Optimizing query & pruning files" time