I have an iterative algorithm which read and writes a dataframe iteration trough a list with new partitions, like this:
for p in partitions_list:
df = spark.read.parquet("adls_storage/p")
df.write.format("delta").mode("overwrite").option("partitionOverwriteMode", "dynamic")
.saveAsTable(schema.my_delta_table)
Max partition data size is 2 tb overall. The job very often succeed after the 4th rerun of the pipeline. Very often it fails due to GC overhead limit exceeded. Also in the standard output I observe many GC allocation failures. Check the screenshot pls.
Looks like the execution plan of the previous dataframes stays in the memory of the driver. Is this so?
Is there a way to purge it after each iteration?