Hi @SP_6721,
Likely Causes
1. I/O Wait Indicates Disk or Network Latency
High IO wait usually means the CPU is idle waiting for disk or network I/O. Common reasons:
- Slow disk (DBFS / external storage) access (e.g., S3, ADLS Gen2 throttling)
- Data skew causing a few tasks to spill to disk
- Cluster nodes shared across multiple jobs or not warmed up
- Cold cache on cluster startup — your earlier run might have benefited from cached metadata or files
2. Partition Explosion or Skew
You increased `spark.sql.shuffle.partitions` from 500 to 20,000. That can:
- Improve performance if you have extremely large data evenly distributed
- Slow down execution if partitions are skewed or task scheduling overhead grows
Also, repartition(100, 'salt') introduces randomness, which can sometimes mask skew but not eliminate it.
3. Sticky Cluster or Deployment Issue
Even with the same config:
- New deployment might hit different underlying compute nodes
- Some clusters have “cold start” penalties (nodes downloading libraries, syncing with workspace, etc.)
- Deployment may trigger different Spark runtime versions or settings (check Spark UI)
What You Can Do
A. Check Spark UI (Stage-Level Analysis)
- Go to Spark UI > Stages
- Look for long tails in tasks (some tasks taking much longer)
- Look at Shuffle Read/Write, Task Duration, GC Time, and Skew
B. Practical Fixes
1. Lower `spark.sql.shuffle.partitions` back to 1000–2000 if 20K is too high for your data volume.
spark.conf.set("spark.sql.shuffle.partitions", 1000)
2. Persist at the right stages: Don’t clear cache immediately if reused. Use `.checkpoint()` or `.persist()` wisely if DF is used multiple times.
3. Monitor I/O Metrics:
- Use Ganglia /Databricks Metrics/ CloudWatch (if on AWS) to observe:
- Disk IOPS
- Network throughput
- CPU IOwait %
4. Skew Mitigation:
- Use salting on skewed joins, not just on DF.
- Inspect .countByKey() distribution to detect skew.
5. Data locality: Use spark.locality.wait=0s if tasks are stuck waiting for preferred nodes.
6. Try autoscaling again: Since 20-node fixed cluster isn’t helping, try autoscaling from 10–30.
LR