topic Smaller dataset causing OOM on large cluster in Get Started Discussions

Smaller dataset causing OOM on large cluster

Klusener — Tue, 06 May 2025 05:52:56 GMT

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table on Databricks. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling on databricks. Each workerVM of type n2-highmem-4 has 32GB memory and 4 cores. Each VM has one executor. 22GB is allocated per executor. ie 22*15=330GB overall executor memory, which seems to be large enough for ~55GB input data. shuffle partition is set to 200. But Im getting OOM error.

Input data volume : 55GB
Number of worker : 15 n2-highmem-4 GCP VM and 1-15 worker with autoscaling
Number of executor per worker : 1
number of core per executor (or worker) : 4 ie. only 4 tasks can run in parallel
shuffle partitions : 200
so number of partitions per worker : 200/15 = ~13 partitions
data per partition : 55GB/200 = ~275MB (this is just for calculation, there would be skew, some partitions will have much more data, is there a way to figure out from spark UI?)
Overall executor memory : 22*15=330GB
- Spark memory (storage+execution) per worker = 0.6*(22000MB-300MB) = ~13GB

Could you please help understand why this is not sufficient leading to oom? Also is it necessary for all ~13 partitions assigned to an executor to fit in memory at once or since only 4 tasks run in parallel per executor, is it sufficient for memory to accommodate just 4 partitions at a time?

Re: Smaller dataset causing OOM on large cluster

Louis_Frolio — Tue, 06 May 2025 20:04:50 GMT

The OutOfMemory (OOM) issue you're experiencing in your PySpark job could stem from several factors. Here's a breakdown of potential causes and mitigation strategies:

Skew in Data Partitions:
- Based on your calculation, the data size per partition is approximately 275 MB. However, due to possible data skew, some partitions could be significantly larger and overwhelm the executor memory. To investigate skew, you can check the Spark UI:
  - Navigate to the "Stages" tab of the Spark UI.
  - For failed stages, examine partition sizes in the stage detail summary.
  - If some partitions are exceptionally large compared to others, this indicates skew.
- To address skew:
  - Increase the number of shuffle partitions beyond 200 to distribute data more evenly.
  - Use Adaptive Query Execution (AQE), which dynamically coalesces skewed partitions at runtime. Enable this with: python spark.conf.set("spark.sql.adaptive.enabled", "true")
  - Consider using Spark’s skew hints to handle skewed joins or aggregations.
Execution Memory:
- Executors have a memory allocation, calculated as 0.6 * (availableMemory - reservedMemory), where approximately 13 GB per executor is available for execution and storage tasks. If tasks for large partitions require more memory, spilling to disk occurs, which can lead to OOM errors.
- Because only four tasks run in parallel on each executor (four cores per executor), memory may only need to accommodate these four concurrent tasks. However, if any single task exceeds its share of memory, you'll encounter OOM. Ensure partitions are small enough for this allocation.
Storage vs. Execution Memory:
- Managing memory pressure due to intermediate data (like shuffle, join, or aggregation) spilling to disk can help reduce OOM issues. You can adjust memory configurations to allocate more towards execution workload: python spark.conf.set("spark.memory.fraction", "0.8") # Adjusts execution memory fraction
- Alternatively, forcing intermediate spills to disk earlier (instead of keeping them in-memory) could mitigate constraints.
Cluster Configuration:
- Evaluate the vertical and horizontal scaling of your cluster. If OOM persists despite partition adjustments, consider increasing the memory for each executor or the number of workers to spread the load more evenly.
- For instance:
  - If upgrading workers, opt for instance types optimized for memory.
  - If increasing the number of workers, repartition the data to maximize parallelism.
Additional Debugging Tips:
- Enable more detailed logging and diagnostic tools to pinpoint challenges in specific stages or tasks.
- Use the Spark SQL and Catalyst optimizations (explain() function) to understand how transformations and actions are executed. An optimal logical and physical plan helps avoid performance bottlenecks.

These steps should help you identify and mitigate the OOM issue affecting your job. As always, iterative tuning and profiling based on specific details of your workload is key to achieving optimal performance.

Hope this helps, Big Roux.

Re: Smaller dataset causing OOM on large cluster

Klusener — Wed, 07 May 2025 06:33:18 GMT

Thank you so much for the detailed response, much appreciate. Two followup question

how do we check the partition size for failed tasks from UI (or skew)? for ex, if I goto Spark UI for the failed stage, it gives summary as below. It shows 4 tasks as failed, but does not indicate partition size that caused oom.
'Summary Metrics' indicates Max shuffle Read Size as 838.1MB. Just curious is not it smaller size to cause OOM?

Re: Smaller dataset causing OOM on large cluster

Louis_Frolio — Wed, 07 May 2025 11:18:26 GMT

We will get back to you shortly.

Re: Smaller dataset causing OOM on large cluster

mark_ott — Wed, 07 May 2025 13:07:43 GMT

You need to enable more Metrics. Click on the below hotlink and turn on all Metrics.

Re: Smaller dataset causing OOM on large cluster

mark_ott — Wed, 07 May 2025 13:09:46 GMT

Opps. That's the wrong pic. Here's the correct one.

Re: Smaller dataset causing OOM on large cluster

mark_ott — Wed, 07 May 2025 13:35:31 GMT

I'm guessing you are running one of more Wide transformations in your query and that is causing Skewed Shuffle Partitions. Go back to Stages tab and check out 'Shuffle Write Size/Records' row.

Re: Smaller dataset causing OOM on large cluster

mark_ott — Wed, 07 May 2025 13:40:06 GMT

I'm guessing you have Shuffle Write Sizes that are > 1GB. That's when things start going down the rathole with things like Spill and OOM. Here's a few questions I have for your. Is Adaptive Query Execution enabled? Also I say in your earlier screen shot you had some nasty Java Garbage Collection. Is your Cluster Photon-enabled? This can reduce the JGC.

Re: Smaller dataset causing OOM on large cluster

mark_ott — Wed, 07 May 2025 13:43:48 GMT

Other things to consider. By any chance do you have Spot instances of Workers turned on (edge case)? I've seen where this hand-cuffs AQE. If have join, do you have the smaller table as the first table in the JOIN? Are you ANALYZE TABLE which can change the Join strategy to one that won't go OOM? These are some things to consider.

Re: Smaller dataset causing OOM on large cluster

Klusener — Wed, 07 May 2025 15:06:57 GMT

Much appreciate @mark_ott and @Louis_Frolio for the prompt response.

The job uses below cluster/settings.

Cluster/spark version - Driver: n2-highmem-4 · Workers: n2-highmem-4 · 5-15 workers · DBR: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12) on GCP
Photon is not enabled
Spot/Preemptible instance is enabled
rest default databricks settings, not set any configs explicitly

I just enabled 'Show Additional Metrics' on stage and attaching both job/stage/task details from Spark UI. Only single job and stage has failed. There is no shuffle write. Is not AQE enabled be default on Spark 3 onwards?

Re: Smaller dataset causing OOM on large cluster

mark_ott — Thu, 08 May 2025 13:04:41 GMT

OK, without having your code or DAG, it's a little difficult to figure this out. But here's something that should work. First, figure out who many Memory Partitions you have. Apparently, your Memory Partitions are too big for the cluster, hence the OOM. Use this generic code as a template.

num_partitions = df.rdd.getNumPartitions() print(num_partitions)

Re: Smaller dataset causing OOM on large cluster

mark_ott — Thu, 08 May 2025 13:08:07 GMT

Next, use the repartition(n) to increase your dataframe to twice the number you got earlier. For example, if num_partitions was 30, then repartition(60) prior to running your query. With half the data in each Memory Partition, I'm guessing you won't OOM. If you still do, increase the number by x2 again until the OOM disappears..