cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best clusters configuration to process 100gb data??

Sunny_singh
New Contributor

Hi Everyone,

I’m new to Data Engineering and often get this interview question:

“What’s the best cluster configuration to process 100GB of data in Databricks?”

How should we answer this from a databricks perspective in two cases:

  1. Complex transformations (heavy shuffle like joins/groupBy)
  2. Simple transformations (minimal shuffle)

What should be the best number for executors, cores, memory, and partitions for above two cases? What should be the workers type?

Would appreciate your guidance. Thanks!

3 REPLIES 3

saurabh18cs
Honored Contributor III

Hi  I dont this is a right place to discuss interview related questions because there are many aspects to this question which you have to deep dive using databrics documentation comprehensive guide.

There is no exact numbering because autoscale plays a crucial role.I wud say for simple tranf.. go with compute optimized with lesser workers and more cores. for complex ones i prefer memory optimized compute type with little more workers with lesser cores so we can avoid spill.

for 100gb data with 128mb initial partitions per task it will come around 800 partitions approx which means for simple job vertical scaling will help to complete those tasks faster and for complex job scale out job will help to complete them faster with less spill,

 

also try using : 

spark.sql.shuffle.partitions=auto

Coffee77
Honored Contributor II

This is a fairly generic question because the answer depends not only on the worker node configuration, but also on the Spark configuration—particularly how partitions are handled before and after shuffles. Other important considerations include whether Adaptive Query Execution (AQE) is enabled and how many files (and of what size) you plan to process concurrently (just one, or multiple?).

When sizing a cluster, you should select a set of workers capable of processing the target number of partitions in parallel. Key factors to consider:

  • Maximum partition size – In Databricks, the default is 128 MB.

  • Number of partitions after shuffle – Default is 200.

  • AQE enabled – If enabled, Spark can dynamically optimize shuffle partitions. However, it’s best practice to monitor this behavior rather than rely entirely on it.

For example, if you process a single 100 GB file with the default 128 MB partition size, this would generate approximately 800 initial partitions. I think this number is very high.

Since:

  • One partition is processed by one task

  • One task requires one core

The total number of cores available directly determines the level of parallelism and the number of execution waves (iterations). More cores increase parallelism—but also cost.

As an architect, your role is to balance performance, cost, and execution time.

Example Sizing Approach

A reasonable starting point could be:

  • 2 × Azure D8as v6

    • 8 cores each

    • 32 GB RAM each

If you increase the maximum partition size to 512 MB, a 100 GB file would produce around 200 initial partitions instead of 800. BUT THIS CONFIGURATION DEPENDS ON YOUR REQUIREMENTS. You could even make larger partitions depending on expected concurrency and requirements.

With 16 total cores:

  • ~12–13 execution waves would be required

  • Approximately 8 GB of data would be processed concurrently

This configuration should leave enough available memory to handle additional DataFrames concurrently, though this must be validated based on workload characteristics.

Enabling AQE is recommended so Spark can optimize shuffle partitions dynamically. However, you should monitor the resulting shuffle partition counts and manually tune them if they are not optimal for the post-shuffle DataFrame size.

If budget allows, consider using instances with local SSD storage to improve shuffle operations, such as:

  • Azure D8ads v6

Local SSDs can significantly improve I/O performance during shuffle and spill operations.

THIS IS ONLY A GENERIC ANSWER.

KR.


Lifelong Solution Architect Learner | Coffee & Data

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @Sunny_singh,

Welcome to Data Engineering! This is a great question and one that comes up frequently. There is no single "correct" answer because real-world sizing depends on data format, cloud provider, data skew, and more, but here is a solid framework you can use to reason through the two cases you described.


FOUNDATIONAL CONCEPTS

Before jumping into configurations, here are the key principles:

- 100GB of compressed Parquet/Delta on disk can expand 2-5x in memory, so plan for 200-500GB of effective in-memory data.
- The default partition size in Databricks is 128MB, so 100GB produces roughly 800 initial partitions.
- One partition is processed by one task, and one task requires one core. So total cores determine how many partitions run in parallel.
- Adaptive Query Execution (AQE) is enabled by default on Databricks Runtime and dynamically optimizes shuffle partitions, coalesces small partitions, and handles skew. This is a major advantage on the platform.

Reference: https://docs.databricks.com/optimizations/aqe.html


CASE 1: SIMPLE TRANSFORMATIONS (MINIMAL SHUFFLE)

Examples: filters, column projections, map-only operations, simple aggregations on pre-partitioned data.

What matters most: raw CPU throughput and read speed. Memory pressure is low since there is little data shuffled between nodes.

Recommended approach -- scale UP (vertical scaling):
- Instance family: Compute-optimized (AWS: c5.4xlarge / Azure: F16s_v2 / GCP: c2-standard-16)
- These give you more cores per dollar since you are CPU-bound, not memory-bound
- Workers: 2-4 workers (each with 16 cores gives you 32-64 total cores)
- Driver: Same type or one size larger
- Autoscaling: Min 2, Max 4

With 64 cores and 800 partitions, you would process in about 12-13 waves. Each wave handles a lightweight task, so this completes quickly.

Partition tuning: The default 200 shuffle partitions is usually fine since there is minimal shuffle. AQE will coalesce small partitions automatically. You can also set:

spark.conf.set("spark.sql.shuffle.partitions", "auto")

This lets Databricks determine the optimal count based on the query plan and data size.


CASE 2: COMPLEX TRANSFORMATIONS (HEAVY SHUFFLE -- JOINS, GROUPBY, WINDOW FUNCTIONS)

Examples: large joins across two 100GB datasets, groupBy with many keys, window functions, multi-stage aggregations.

What matters most: memory (to avoid disk spill during shuffle) and local SSD storage (for fast spill when it does happen). Shuffle operations move data between executors across the network, so you want enough memory to hold intermediate results and fast local disks as a safety net.

Recommended approach -- scale OUT (horizontal scaling):
- Instance family: Memory-optimized with local SSD (AWS: i3.2xlarge or r5d.2xlarge / Azure: E8ds_v5 or L8s_v3 / GCP: n2-highmem-8)
- Local SSDs are important here because shuffle spill writes to local disk, and SSDs make that dramatically faster
- Workers: 4-8 workers (each with 8 cores and 32-64GB RAM)
- Driver: One size larger to handle the query plan and result collection
- Autoscaling: Min 4, Max 8-10

With 8 workers at 8 cores each (64 cores) and 32GB RAM each (256GB total), you have enough memory to handle the in-memory expansion of 100GB and enough parallelism for shuffle stages.

Partition tuning: For heavy shuffles, you may want to increase shuffle partitions:

spark.conf.set("spark.sql.shuffle.partitions", "400")

This creates smaller partitions that are less likely to cause memory pressure during shuffle. However, with AQE enabled (the default), the system will automatically coalesce partitions that are too small, so starting with "auto" is also a good approach:

spark.conf.set("spark.sql.shuffle.partitions", "auto")


WHY THE DIFFERENCE?

The key insight is:

- Simple workloads are CPU-bound: fewer, more powerful nodes work well because there is little data exchange between nodes. This is vertical scaling.

- Complex workloads are memory-and-IO-bound: more nodes with generous memory spread the shuffle data across more executors, reducing per-executor memory pressure. More local SSDs across the cluster also means faster spill recovery. This is horizontal scaling.


ADDITIONAL OPTIMIZATIONS FOR BOTH CASES

1. Enable Photon: Databricks' native vectorized engine accelerates SQL and DataFrame operations (especially joins and aggregations) often by 2-5x. Enable it by checking "Use Photon Acceleration" in the compute configuration.
Reference: https://docs.databricks.com/runtime/photon

2. Use Delta format: If your 100GB is stored as Delta, run OPTIMIZE to compact small files and use liquid clustering on frequently-filtered columns. This reduces read overhead significantly.

3. Consider Serverless Compute: If you want to skip the sizing exercise entirely, Databricks serverless compute handles all of this automatically. It scales on demand, requires no configuration, and is optimized out of the box. For interview discussions this may not be the expected answer, but in practice it is often the simplest path.
Reference: https://docs.databricks.com/compute/cluster-config-best-practices.html

4. Use the latest Databricks Runtime LTS: This ensures you get AQE, dynamic file pruning, and all the latest optimizations enabled by default.


QUICK REFERENCE TABLE

Here is a side-by-side comparison:

Simple Transformations:
Instance type: Compute-optimized (c5, F-series)
Workers: 2-4
Cores per worker: 16
RAM per worker: 32GB
Scaling approach: Vertical (fewer, bigger nodes)
Shuffle partitions: auto or 200
Key bottleneck: CPU throughput

Complex Transformations:
Instance type: Memory-optimized with SSD (i3, r5d, E-series, L-series)
Workers: 4-8
Cores per worker: 8
RAM per worker: 32-64GB
Scaling approach: Horizontal (more nodes)
Shuffle partitions: auto or 400
Key bottleneck: Memory and shuffle I/O

Remember, these are starting points. In practice, you would monitor the Spark UI to look for spill, long-running tasks, or underutilized cores, and then adjust accordingly. The Databricks documentation on cluster configuration best practices is an excellent resource for going deeper:
https://docs.databricks.com/compute/cluster-config-best-practices.html

Hope this helps with both interviews and real-world sizing!

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.