cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

what is iowait, will it impact performance of my job

db_eswar
New Contributor
One job taking more than 7hrs, when i added below configuration its taking <2:30 mins but after deployment with same parameters again its taking 7+hrs.
 
1) spark.conf.set("spark.sql.shuffle.partitions", 500) --> spark.conf.set("spark.sql.shuffle.partitions", 20000) 
2) spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
    rdd.unpersist()
    print("Unpersisted {} rdd".format(id))
 
3) DF = DF.withColumn('salt', F.rand())
DF = DF.repartition(100, 'salt')
 
Tried with fixed 20 nodes still taking 7+ hrs after deployment(no change in notebook and cluster configuration)
 
Before deployment 1:20(Auto scaling) also taking <2:30 mins

whenever iowait is taking time then my job taking more time to complete.

2 REPLIES 2

SP_6721
Contributor

Hi @db_eswar 

High iowait in your Spark jobs is probably caused by storage or disk bottlenecks, not CPU or memory issues. The slowdown you're seeing could be due to a cold cache, slower disks, or increased resource usage.

To troubleshoot, you can use the Spark UI and your cloud provider's monitoring tools to keep an eye on iowait, disk, and network activity during the job. It's also a good idea to avoid shared or overloaded disk and network resources. For critical jobs, dedicated clusters with high-throughput storage work best. And don't forget to make sure Delta caching is being used properly for better performance.

lingareddy_Alva
Honored Contributor II

Hi @SP_6721,

Likely Causes
1. I/O Wait Indicates Disk or Network Latency
High IO wait usually means the CPU is idle waiting for disk or network I/O. Common reasons:
- Slow disk (DBFS / external storage) access (e.g., S3, ADLS Gen2 throttling)
- Data skew causing a few tasks to spill to disk
- Cluster nodes shared across multiple jobs or not warmed up
- Cold cache on cluster startup — your earlier run might have benefited from cached metadata or files


2. Partition Explosion or Skew
You increased `spark.sql.shuffle.partitions` from 500 to 20,000. That can:
- Improve performance if you have extremely large data evenly distributed
- Slow down execution if partitions are skewed or task scheduling overhead grows
Also, repartition(100, 'salt') introduces randomness, which can sometimes mask skew but not eliminate it.


3. Sticky Cluster or Deployment Issue
Even with the same config:
- New deployment might hit different underlying compute nodes
- Some clusters have “cold start” penalties (nodes downloading libraries, syncing with workspace, etc.)
- Deployment may trigger different Spark runtime versions or settings (check Spark UI)

What You Can Do
A. Check Spark UI (Stage-Level Analysis)
- Go to Spark UI > Stages
- Look for long tails in tasks (some tasks taking much longer)
- Look at Shuffle Read/Write, Task Duration, GC Time, and Skew

B. Practical Fixes

1. Lower `spark.sql.shuffle.partitions` back to 1000–2000 if 20K is too high for your data volume.
spark.conf.set("spark.sql.shuffle.partitions", 1000)
2. Persist at the right stages: Don’t clear cache immediately if reused. Use `.checkpoint()` or `.persist()` wisely if DF is used multiple times.
3. Monitor I/O Metrics:
- Use Ganglia /Databricks Metrics/ CloudWatch (if on AWS) to observe:
- Disk IOPS
- Network throughput
- CPU IOwait %
4. Skew Mitigation:
- Use salting on skewed joins, not just on DF.
- Inspect .countByKey() distribution to detect skew.

5. Data locality: Use spark.locality.wait=0s if tasks are stuck waiting for preferred nodes.
6. Try autoscaling again: Since 20-node fixed cluster isn’t helping, try autoscaling from 10–30.

LR

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now