cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Iterative read and writes cause java.lang.OutOfMemoryError: GC overhead limit exceeded

Chalki
New Contributor III

I have an iterative algorithm which read and writes a dataframe iteration trough a list with new partitions, like this:

 

for p in partitions_list:
df = spark.read.parquet("adls_storage/p")
df.write.format("delta").mode("overwrite").option("partitionOverwriteMode", "dynamic")
.saveAsTable(schema.my_delta_table)

 

Max partition data size is 2 tb overall. The job very often succeed after the 4th rerun of the pipeline. Very often it fails due to GC overhead limit exceeded. Also in the standard output I observe many GC allocation failures. Check the screenshot pls.

Looks like the execution plan of the previous dataframes stays in the memory of the driver. Is this so?
Is there a way to purge it after each iteration?

3 REPLIES 3

Chalki
New Contributor III

I forgot to mention that on the df creation I am using filter method, cause actually p is an object:

{cntr_id : 12, secure_key: 15, load_dates: [date1, date 2, ...]. The filter looks like:

df = spark.read.parquet("adls_storage").where((col(cntr_id) == p[cntr_id]) & (col(load_date).isin(p[load_dates])

daniel_sahal
Esteemed Contributor

@Chalki 
GC Allocation Failure is a little bit confusing - it indicates that GC kicks in because there's not enough memory left in heap. That's normal and you shouldn't worry about GC Allocation Failure.

What worries more is "GC overhead limit exceeded", it means that JVM spent too much time doing GC and there was no big gain out of it.

Without doing a proper debugging of your code I would say - just scale up.

Chalki
New Contributor III

@daniel_sahalI've attached the wrong snip/ Actually it is FULL GC Ergonomics, which was bothering me. Now I am attaching the correct snip.  But as you said I scaled a bit. The thing I forgot to mention is that the table is wide - more than 300 columns. I am not creating extra objects inside the loop except of the dataframe on each iteration, but it gets overwritten on the next one.
I still can't figure out how is the memory building so much in the driver node. Could you give me  some more details about it? For my own knowledge

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group