cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Question on cluster sizing as per SLA - No resources in DE certification

praveenm00
New Contributor
How do we optimally size clusters and set configs for any given SLA in production workloads. It would have been great to have a real-life project or implementation to understand in detail. I wish Databricks had a good resource in their certification preparation for answers.
2 REPLIES 2

caua-ferreiraa
New Contributor II

You’re right, the DE cert doesn’t really go deep into how to size clusters for a specific SLA. In real projects, we usually work backwards:

SLA → workload characteristics → cluster config → measure → adjust.

Here’s a practical way to think about it.


1. Start from the SLA and workload type

Questions I usually ask first:

  • How fast does this need to finish? (e.g. 15 minutes, 1 hour, 4 hours)
  • How much data per run? (GB/TB)
  • How often does it run? (every 5 min, hourly, daily)
  • How many jobs run in parallel?
  • Is it interactive / ad‑hoc work, or a scheduled production job?

From that you decide:

  • Job / automated clusters for production pipelines
  • All‑purpose clusters for dev / exploration
  • Serverless (where available) when the workload is spiky or unpredictable

2. Understand what the job actually does

For a typical ETL job on Delta, look at:

  • Heavy joins, window functions, ML, or just simple transforms?
  • Is it more limited by CPU, memory, or I/O / shuffle?
  • Do you have lots of small files or poorly partitioned tables?

Rough rule of thumb:

  • Mostly CPU work → more cores / compute‑optimized nodes
  • Big joins / shuffles → more RAM / memory‑optimized nodes and better partitioning

3. Initial sizing (just to get started)

This is not a formula, just a starting point so you can test and adjust.

  1. Pick an instance family

    • Default → general‑purpose nodes
    • Heavy transforms / complex logic → compute‑optimized
    • Huge joins / aggregations → memory‑optimized
  2. Rough idea for node count

    • Up to ~500 GB per run → 4–8 medium nodes
    • ~0.5–2 TB → 8–16 medium nodes
    • More than that → 16+ nodes or larger instances
  3. Use autoscaling

    • Set a low min_workers (2–3) and a max_workers that fits your SLA and budget.
    • Turn on auto‑termination so you don’t pay for idle time.

Run the job, see how long it takes, and then scale up/down from there.


4. Tune based on metrics

After a few runs, check:

  • CPU ~100% and job is slow → add more nodes or move to a bigger instance type.
  • Lots of shuffle / spill to disk → more memory, better partitioning, features like AQE, Delta OPTIMIZE/ZORDER.
  • SLA is easy to meet but cost is high → shrink the cluster or lower the max in autoscaling.

Real‑world sizing is always iterative:
estimate → run → measure → adjust.


5. How this ties back to the DE certification

The exam doesn’t expect you to memorize exact sizes. It cares more about concepts like:

  • When to use job clusters vs. all‑purpose
  • When autoscaling makes sense
  • Picking the most cost‑effective option that still meets the SLA
  • Using Delta + optimizations so you don’t have to oversize the cluster

It would definitely be nice if Databricks had an official end‑to‑end example just on this topic. Until then, the best “training” is to spin up a small test workspace, start with a modest cluster, and keep iterating until you hit your SLA comfortably.

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @praveenm00 , 

Good question, and honestly a fair callout on the cert — it covers cluster config conceptually but never puts you in front of a real sizing problem and there is a good reason for this - it is hard and depends on many factors.

Here's how most practitioners actually approach it.

The hard truth first: there's no formula. Sizing for SLA is workload-dependent, so the right move is to profile first, then size — not the other way around.

Before touching any config, get clear on four things: data volume and growth rate, transform complexity (simple filters vs. heavy joins and aggregations), concurrency, and the actual SLA (deadline, max latency, or throughput target). A batch ETL job with a 6am completion window is a fundamentally different problem than a streaming pipeline with a sub-second latency target. The cluster gets tuned to meet the SLA at reasonable cost — not the other way around.

If serverless is available in your org, that's genuinely where I'd start. For most production workloads on Databricks today it's the recommended default — instance selection, autoscaling, and Photon are handled for you.

If you're on classic compute, think in terms of total cores, total memory, and local storage — not just worker count. Match the instance type to the workload: memory-optimized for shuffle-heavy or aggregation-heavy jobs, compute-optimized for CPU-bound work, storage-optimized if you're spilling to disk and partitioning won't fix it.

On autoscaling: use it, but configure it properly. Set min_workers to what the job actually needs at minimum — not 1. Pair it with instance pools to keep warm nodes available so autoscaling ramp-up doesn't eat into your SLA window.

One thing worth flagging on streaming: classic node-count autoscaling isn't a great fit for latency-sensitive pipelines. If you're on Lakeflow Declarative Pipelines, enhanced autoscaling uses task-slot utilization and queue depth instead — meaningfully different behavior, and worth reading the DLT-specific docs before assuming they work the same way.

And on validation — the part the cert skips entirely — there's no shortcut here. What Databricks' own performance docs describe is just an empirical loop: test on production-representative data, run a few candidate configs, compare runtime against your SLA in the Spark UI (spill, shuffle size, skew), and adjust. Missing the SLA means scaling out or fixing the code. Easily beating it with idle capacity means you're over-provisioned.

Less satisfying than a formula, but that's genuinely how it gets done.

Hope this helps, Louis.