cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Relevance of off heap memory and usage

Klusener
Contributor

I was referring to the doc - https://kb.databricks.com/clusters/spark-executor-memory.

In general total off heap memory is  =  spark.executor.memoryOverhead + spark.offHeap.size.  The off-heap mode is controlled by the properties spark.memory.offHeap.enabled. 

Could you please clarify :

  • difference between  spark.executor.memoryOverhead  vs spark.offHeap.size ? when to use one over other?
  • In what use-cases/scenarios/operations Spark needs offheap memory
  • when we set spark.memory.offHeap.enabled to false, does it disables only 'spark.offHeap.size' or both spark.executor.memoryOverhead  and spark.offHeap.size?
6 REPLIES 6

Vidhi_Khaitan
Databricks Employee
Databricks Employee

Hi team,

Answering your questions below -
spark.executor.memoryOverhead: This refers to additional memory allocated for each executor beyond the JVM heap (spark.executor.memory). In short, used for for JVM-related overheads. 
1) JVM overhead, including metadata and garbage collection (GC) overheads.
2) Spark's internal data structures, such as task metadata and shuffle buffers.
3) Python interpreter memory in case of PySpark usage.

spark.offHeap.size: This defines the amount of off-heap memory allocated for Spark executors. Off-heap memory exists outside the JVM heap and is often used for storing large contiguous blocks of data (e.g., shuffle data or intermediate results), avoiding GC overheads.

Operations where Spark uses off-heap memory ->
Caching large datasets: Spark may store datasets in off-heap memory to reduce JVM heap memory pressure.
Shuffle operations: Off-heap memory can be used to handle large shuffle operations to minimize GC pressure.
Sorting and aggregations: Results of large-scale sorting or aggregation operations may use off-heap memory.

If spark.memory.offHeap.enabled is set to false, it disables only the spark.offHeap.size memory allocation. However, spark.executor.memoryOverhead remains unaffected, as it is used for JVM-related overheads and other Spark processes

I hope I have answered your questions!

 

Thanks for the detailed explanation. Much Appreciate. As memory has 3 elements as below, can you suggest, given both #2 and #3 are part of JVM (on heap) memory, why do we need #3? when #3 is used over #2?

  1. offheap
  2. spark.executor.memory
  3. spark.executor.memoryOverhead

saurabh18cs
Honored Contributor
  • spark.executor.memory is for JVM heap memory, while spark.executor.memoryOverhead is for non-JVM memory. The off-heap memory is outside the ambit of Garbage Collection

The total off-heap memory for a Spark executor is controlled by spark.executor.memoryOverhead. The default value for this is 10% of executor memory subject to a minimum of 384MB. This means, even if the user does not explicitly set this parameter, Spark would set aside 10% of executor memory(or 384MB whichever is higher) for VM overheads.

 

thanks for the response. As memory has 3 elements as below, can you suggest, given both #2 and #3 are part of on heap memory, why do we need #3? when #3 is used over #2?

  1. offheap
  2. spark.executor.memory
  3. spark.executor.memoryOverhead

@Vidhi_Khaitan could you please respond for above query? thanks.

Vidhi_Khaitan
Databricks Employee
Databricks Employee

Hello,

Thanks for the follow up!

The configuration for spark.executor.memory and spark.executor.memoryOverhead serves distinct purposes within Spark's memory management:

spark.executor.memory: This controls the allocated memory for each executor's JVM heap. The JVM uses this memory to store application objects and execute tasks. However, as the heap memory usage grows, garbage collection processes can become slow and introduce latency.

spark.executor.memoryOverhead: This parameter accounts for additional memory beyond the JVM heap for handling specific elements:
JVM-related overhead, such as garbage collection metadata.
Internal Spark structures, including task metadata and shuffle buffers.
Other system-level activities, like Python interpreter memory when using PySpark

spark.executor.memoryOverhead helps to isolate and manage memory outside of the JVM heap. This ensures that operations requiring memory not directly related to application execution, such as managing task metadata or shuffle data buffers, do not interfere with the JVM heap space. Without this dedicated allocation, JVM heap memory might experience additional pressure, causing increased garbage collection overhead and performance instability.

Use of spark.executor.memory: Prioritized for application objects and task execution when JVM garbage collection overhead is not critical and workload fits well within the allocated heap memory

Use of spark.executor.memoryOverhead: Necessary for workloads with frequent shuffle operations or substantial auxiliary memory needs. It ensures operational stability by isolating this overhead

 

 

 

 

 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now