Hello @Rennzie , think of the โOptimizing query & pruning filesโ step as the warm-up routine before the warehouse starts lifting any real weights. In this window, the engine is lining up the play: skipping irrelevant files, compiling the plan, and conducting the necessary security checks before we ever touch the underlying data.
If you check the Query Profile, Databricks separates this phase cleanly from scheduling and execution, so when you see spikes here, youโre looking at pre-scan work โ not the scan itself.
Why the duration jumps around
A few patterns consistently show up in the wild:
-
Cold vs. warm cache
After a warehouse wakes up from suspension, those first couple of queries stretch their legs. Metadata is cold, the cache is empty, and the warehouse needs a moment before it behaves like its usual self. After a few runs, everything tightens up.
-
Data skipping job scheduling
If the system decides it needs to run a PrepareDeltaScan job to trim the file list, the scheduling delay alone can dominate this phase โ especially when the warehouse is already juggling other work. This is why the timing can feel random across identical runs.
-
Delta Lake metadata size and file fragmentation
Huge logs and a jackpot of tiny files slow down metadata reads and skew pruning times. Teams routinely see this improve immediately after a round of compaction and checkpointing.
-
Layout alignment
If your data isnโt physically laid out to support your predicates, pruning becomes less effective โ and the optimization phase spends more time figuring out what it can safely ignore.
What you can do about it
Here are the levers that consistently move the needle:
-
Warm the warehouse
Run a lightweight filter on your busiest tables after a restart. This primes caches, metadata, and remote result paths. Remember: the remote result cache sticks around for 24 hours, but only for unchanged tables and exact query matches.
-
Reduce file and metadata bloat
OPTIMIZE + VACUUM is still the gold standard. If you lean heavily on certain filters, Z-ORDER or Liquid Clustering will improve locality and make pruning much more decisive.
-
Manage concurrency
If the warehouse is swamped, the pre-scan phase gets stuck in traffic. Consider bumping warehouse size or adjusting autoscaling if your workload bursts unpredictably.
-
Align filters with layout
Target partition columns or clustering keys. If your WHERE clause isnโt aligned with how the data is arranged, the engine has no choice but to work harder up front.
-
Let the Query Profile guide you
If โPrepareDeltaScanโ dominates the Top rules section, thatโs metadata overhead waving its hands. Focus your fixes there.
A quick diagnostic pass
If you want to sanity-check your environment fast:
-
Compare a โslowโ vs โfastโ Query Profile. Look at files pruned, bytes read, and which rules are taking the most time.
-
Check whether the warehouse had just restarted โ cold starts change everything.
-
Observe concurrency: autoscaling firing? Several queries landing at once?
-
Inspect table health: file counts, partition spread, last OPTIMIZE/VACUUM run.
-
Validate predicate alignment with your partitioning or clustering strategy.
Hope this sheds some light on the matter.
Cheers, Louis.