Iterative read and writes cause java.lang.OutOfMemoryError: GC overhead limit exceeded
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-24-2023 11:22 PM
I have an iterative algorithm which read and writes a dataframe iteration trough a list with new partitions, like this:
Max partition data size is 2 tb overall. The job very often succeed after the 4th rerun of the pipeline. Very often it fails due to GC overhead limit exceeded. Also in the standard output I observe many GC allocation failures. Check the screenshot pls.
Looks like the execution plan of the previous dataframes stays in the memory of the driver. Is this so?
Is there a way to purge it after each iteration?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-24-2023 11:31 PM
I forgot to mention that on the df creation I am using filter method, cause actually p is an object:
{cntr_id : 12, secure_key: 15, load_dates: [date1, date 2, ...]. The filter looks like:
df = spark.read.parquet("adls_storage").where((col(cntr_id) == p[cntr_id]) & (col(load_date).isin(p[load_dates])
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2023 10:45 PM
@Chalki
GC Allocation Failure is a little bit confusing - it indicates that GC kicks in because there's not enough memory left in heap. That's normal and you shouldn't worry about GC Allocation Failure.
What worries more is "GC overhead limit exceeded", it means that JVM spent too much time doing GC and there was no big gain out of it.
Without doing a proper debugging of your code I would say - just scale up.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2023 12:55 AM - edited 07-26-2023 12:55 AM
@daniel_sahalI've attached the wrong snip/ Actually it is FULL GC Ergonomics, which was bothering me. Now I am attaching the correct snip. But as you said I scaled a bit. The thing I forgot to mention is that the table is wide - more than 300 columns. I am not creating extra objects inside the loop except of the dataframe on each iteration, but it gets overwritten on the next one.
I still can't figure out how is the memory building so much in the driver node. Could you give me some more details about it? For my own knowledge

