Job performance issue : Configurations

Vasu_Kumar_T — Tue, 27 May 2025 13:21:35 GMT

Hello All,

One job taking more than 7hrs, when we added below configuration its taking <2:30 mins but after deployment with same parameters again its taking 7+hrs.

1) spark.conf.set("spark.sql.shuffle.partitions", 500) --> spark.conf.set("spark.sql.shuffle.partitions", 20000)
2) spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))

3) DF = DF.withColumn('salt', F.rand())
DF = DF.repartition(100, 'salt')

Tried with fixed 20 nodes still taking 7+ hrs after deployment(no change in notebook and cluster configuration)

Before deployment 1:20(Auto scaling) also taking <2:30 mins

Any suggesstions are appriciated. Thanks

Vasu

Re: Job performance issue : Configurations

lingareddy_Alva — Tue, 27 May 2025 18:00:49 GMT

Hi @Vasu_Kumar_T

This is a classic Spark performance inconsistency issue. The fact that it works fine in your notebook
but degrades after deployment suggests several potential causes. Here are the most likely culprits and solutions:

Primary Suspects
1. Data Skew Variations
Your salt-based repartitioning might not be consistently effective
if the underlying data distribution changes between runs or environments.

2. Cluster Resource Allocation
Fixed 20 nodes doesn't guarantee same resource allocation as auto-scaling.

3. Memory and Executor Configuration

Environment-Specific Solutions
Check these deployment differences:
- Spark version consistency between notebook and deployment
- Network bandwidth between nodes in production vs. development
- Storage type (SSD vs. HDD) and I/O throughput

topic Re: Job performance issue : Configurations in Data Engineering

Job performance issue : Configurations

Re: Job performance issue : Configurations