Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2025 10:06 PM
Your setup (160GB total memory across 10 workers with 16GB per node and 7.6GB per executor) is substantial, but running out of memory during a DISTINCT operation can still occur due to how Spark handles shuffle data and intermediate processing during deduplication.
While 30GB compressed data might seem manageable given your cluster's resources, several factors can impact memory usage during Spark operations:
1.Decompression Overhead:
- Compressed data inflates significantly upon decompression. For example, a 30GB compressed CSV might expand to 100GB+ in memory, depending on the compression ratio and data structure
2.Shuffle Stage Memory Usage:
- When Spark executes the DISTINCT operation, it shuffles and sorts data to identify unique rows. This involves holding a significant portion of the data in memory and spilling excess data to disk. Large shuffle files can overwhelm both memory and I/O bandwidth.
3.Executor Memory Overhead:
- With 7.6GB memory per executor, a portion of that is reserved for the JVM overhead and shuffle memory buffers, leaving less memory available for processing. If a single executor gets overloaded( this may happen when you have skewed data), it may run out of memory despite the cluster having sufficient total memory.