cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Job performance issue : Configurations

Vasu_Kumar_T
New Contributor II

Hello All,

 

One job taking more than 7hrs, when we added below configuration its taking <2:30 mins but after deployment with same parameters again its taking 7+hrs.

 1) spark.conf.set("spark.sql.shuffle.partitions", 500) --> spark.conf.set("spark.sql.shuffle.partitions", 20000) 
2) spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
    rdd.unpersist()
    print("Unpersisted {} rdd".format(id))

3) DF = DF.withColumn('salt', F.rand())
DF = DF.repartition(100, 'salt')

 Tried with fixed 20 nodes still taking 7+ hrs after deployment(no change in notebook and cluster configuration)

 Before deployment 1:20(Auto scaling) also taking <2:30 mins

 

Any suggesstions are appriciated. Thanks

 

Vasu

1 REPLY 1

lingareddy_Alva
Honored Contributor III

Hi @Vasu_Kumar_T 

This is a classic Spark performance inconsistency issue. The fact that it works fine in your notebook
but degrades after deployment suggests several potential causes. Here are the most likely culprits and solutions:

Primary Suspects
1. Data Skew Variations
Your salt-based repartitioning might not be consistently effective
if the underlying data distribution changes between runs or environments.

2. Cluster Resource Allocation
Fixed 20 nodes doesn't guarantee same resource allocation as auto-scaling.

3. Memory and Executor Configuration

Environment-Specific Solutions
Check these deployment differences:
- Spark version consistency between notebook and deployment
- Network bandwidth between nodes in production vs. development
- Storage type (SSD vs. HDD) and I/O throughput

 

LR