ReKa
New Contributor III

This is a generic problem.

Cheap solution is to increase number of shuffle partitions (in case loads are skewed) or restart the cluster.

Safe solution is to increase cluster size or node sizes (SSD, RAM,…)

Eventually, you have to make sure that you have efficient codes. You read and write (do not keep things in memory, but instead process like a streaming pipeline from source to sink). Things like repartition can break this efficiency.

Also make sure that you are not overwriting a cached variable. For example below code is wrong:

df=…cache()

df=df.withColumn(…..).cache()

Instead put an unpersist between both lines. Otherwise there is an orphan reference to a cached data.