Hello All,
One job taking more than 7hrs, when we added below configuration its taking <2:30 mins but after deployment with same parameters again its taking 7+hrs.
1) spark.conf.set("spark.sql.shuffle.partitions", 500) --> spark.conf.set("spark.sql.shuffle.partitions", 20000)
2) spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))
3) DF = DF.withColumn('salt', F.rand())
DF = DF.repartition(100, 'salt')
Tried with fixed 20 nodes still taking 7+ hrs after deployment(no change in notebook and cluster configuration)
Before deployment 1:20(Auto scaling) also taking <2:30 mins
Any suggesstions are appriciated. Thanks
Vasu