cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Iterative read and writes cause java.lang.OutOfMemoryError: GC overhead limit exceeded

Chalki
New Contributor III

I have an iterative algorithm which read and writes a dataframe iteration trough a list with new partitions, like this:

 

for p in partitions_list:
df = spark.read.parquet("adls_storage/p")
df.write.format("delta").mode("overwrite").option("partitionOverwriteMode", "dynamic")
.saveAsTable(schema.my_delta_table)

 

Max partition data size is 2 tb overall. The job very often succeed after the 4th rerun of the pipeline. Very often it fails due to GC overhead limit exceeded. Also in the standard output I observe many GC allocation failures. Check the screenshot pls.

Looks like the execution plan of the previous dataframes stays in the memory of the driver. Is this so?
Is there a way to purge it after each iteration?

3 REPLIES 3

Chalki
New Contributor III

I forgot to mention that on the df creation I am using filter method, cause actually p is an object:

{cntr_id : 12, secure_key: 15, load_dates: [date1, date 2, ...]. The filter looks like:

df = spark.read.parquet("adls_storage").where((col(cntr_id) == p[cntr_id]) & (col(load_date).isin(p[load_dates])

daniel_sahal
Esteemed Contributor

@Chalki 
GC Allocation Failure is a little bit confusing - it indicates that GC kicks in because there's not enough memory left in heap. That's normal and you shouldn't worry about GC Allocation Failure.

What worries more is "GC overhead limit exceeded", it means that JVM spent too much time doing GC and there was no big gain out of it.

Without doing a proper debugging of your code I would say - just scale up.

Chalki
New Contributor III

@daniel_sahalI've attached the wrong snip/ Actually it is FULL GC Ergonomics, which was bothering me. Now I am attaching the correct snip.  But as you said I scaled a bit. The thing I forgot to mention is that the table is wide - more than 300 columns. I am not creating extra objects inside the loop except of the dataframe on each iteration, but it gets overwritten on the next one.
I still can't figure out how is the memory building so much in the driver node. Could you give me  some more details about it? For my own knowledge

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!