Executor OOM Error with AQE enabled
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2025 04:22 AM
We have Databricks Spark Job. After migration from Databricks Runtime 10.4 to 15.4 one of our Spark jobs which uses broadcast hint started to fail with error:
```
ERROR Executor: Exception in task 2.0 in stage 371.0 (TID 16912)
org.apache.spark.memory.SparkOutOfMemoryError: [EXECUTOR_BROADCAST_JOIN_OOM] There is not enough memory to build the broadcast relation LongToUnsafeRowMap. Relation Size = 1462.4 MiB. Total memory used by this task = 1526.4 MiB. Executor Memory Manager Metrics: onHeapExecutionMemoryUsed = 2.4 GiB, offHeapExecutionMemoryUsed = 0.0 B, onHeapStorageMemoryUsed = 472.5 MiB, offHeapStorageMemoryUsed = 0.0 B. [sparkPlanId: Some(44226)] SQLSTATE: 53200
```
This job fails regardless resources we use, it fails even with Standard_D8s_v3 worker nodes, which has 32GB RAM.
Also before the error we have log message which show that there is enough memory.
```
INFO MemoryStore: Block broadcast_188 stored as values in memory (estimated size 359.3 KiB, free 24.0 GiB)
```
Looks like this is Adaptive Query Execution issue, as disabling it solves the problem.
Could anybody advise how to overcome this issue without disabling AQE?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2025 04:54 AM
have you tried removing the broadcast hint?
In recent versions of databricks runtime a lot of optimizations have been added.
Also: do you use the photon engine?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2025 05:06 AM
no, we dont want to remove broadcast hint, as it works without problems in DBR 10.4, and there is a lot of memory availble for it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2025 05:12 AM
ok, that is up to you.
An executor will not be able to take all the ram.
you can try to work with the spark.executor parameters.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-01-2025 04:06 AM
I found similar issue
https://kb.databricks.com/python/job-fails-with-not-enough-memory-to-build-the-hash-map-error
Looks like the reason of error is a bug in new Databricks feature which is called executor-side broadcast (ebj, executor broadcast join) which was introduced in DBR 11.3.
Unfortunately could not find the way how to disable this feature, so keeping AQE disable so far.