cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Understanding what impacts "Optimizing query & pruning files" time

Rennzie
Visitor

Hi everyone. 

I'm keen to understand what impacts the "Optimizing query & pruning files" times seen in the Query Profile History for a sQL Warehouse query. We're querying against delta tables and cannot find any information about what or when a query will have a long complication and optimisation time. From observation it appears to be random and we'd like a way to mitigate it's effects. 

 

Any help is much appreciated. 

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @Rennzie , think of the โ€œOptimizing query & pruning filesโ€ step as the warm-up routine before the warehouse starts lifting any real weights. In this window, the engine is lining up the play: skipping irrelevant files, compiling the plan, and conducting the necessary security checks before we ever touch the underlying data.

If you check the Query Profile, Databricks separates this phase cleanly from scheduling and execution, so when you see spikes here, youโ€™re looking at pre-scan work โ€” not the scan itself.

 

Why the duration jumps around

A few patterns consistently show up in the wild:

  • Cold vs. warm cache

    After a warehouse wakes up from suspension, those first couple of queries stretch their legs. Metadata is cold, the cache is empty, and the warehouse needs a moment before it behaves like its usual self. After a few runs, everything tightens up.

  • Data skipping job scheduling

    If the system decides it needs to run a PrepareDeltaScan job to trim the file list, the scheduling delay alone can dominate this phase โ€” especially when the warehouse is already juggling other work. This is why the timing can feel random across identical runs.

  • Delta Lake metadata size and file fragmentation

    Huge logs and a jackpot of tiny files slow down metadata reads and skew pruning times. Teams routinely see this improve immediately after a round of compaction and checkpointing.

  • Layout alignment

    If your data isnโ€™t physically laid out to support your predicates, pruning becomes less effective โ€” and the optimization phase spends more time figuring out what it can safely ignore.

 

What you can do about it

Here are the levers that consistently move the needle:

  • Warm the warehouse

    Run a lightweight filter on your busiest tables after a restart. This primes caches, metadata, and remote result paths. Remember: the remote result cache sticks around for 24 hours, but only for unchanged tables and exact query matches.

  • Reduce file and metadata bloat

    OPTIMIZE + VACUUM is still the gold standard. If you lean heavily on certain filters, Z-ORDER or Liquid Clustering will improve locality and make pruning much more decisive.

  • Manage concurrency

    If the warehouse is swamped, the pre-scan phase gets stuck in traffic. Consider bumping warehouse size or adjusting autoscaling if your workload bursts unpredictably.

  • Align filters with layout

    Target partition columns or clustering keys. If your WHERE clause isnโ€™t aligned with how the data is arranged, the engine has no choice but to work harder up front.

  • Let the Query Profile guide you

    If โ€œPrepareDeltaScanโ€ dominates the Top rules section, thatโ€™s metadata overhead waving its hands. Focus your fixes there.

 

A quick diagnostic pass

If you want to sanity-check your environment fast:

  1. Compare a โ€œslowโ€ vs โ€œfastโ€ Query Profile. Look at files pruned, bytes read, and which rules are taking the most time.

  2. Check whether the warehouse had just restarted โ€” cold starts change everything.

  3. Observe concurrency: autoscaling firing? Several queries landing at once?

  4. Inspect table health: file counts, partition spread, last OPTIMIZE/VACUUM run.

  5. Validate predicate alignment with your partitioning or clustering strategy.

 

Hope this sheds some light on the matter.

Cheers, Louis.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now