<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: what is iowait, will it impact performance of my job in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/what-is-iowait-will-it-impact-performance-of-my-job/m-p/117177#M45443</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/162847"&gt;@db_eswar&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;High iowait in your Spark jobs is probably caused by storage or disk bottlenecks, not CPU or memory issues. The slowdown you're seeing could be due to a cold cache, slower disks, or increased resource usage.&lt;/P&gt;&lt;P&gt;To troubleshoot, you can use the Spark UI and your cloud provider's monitoring tools to keep an eye on iowait, disk, and network activity during the job. It's also a good idea to avoid shared or overloaded disk and network resources. For critical jobs, dedicated clusters with high-throughput storage work best. And don't forget to make sure Delta caching is being used properly for better performance.&lt;/P&gt;</description>
    <pubDate>Wed, 30 Apr 2025 15:21:18 GMT</pubDate>
    <dc:creator>SP_6721</dc:creator>
    <dc:date>2025-04-30T15:21:18Z</dc:date>
    <item>
      <title>what is iowait, will it impact performance of my job</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-iowait-will-it-impact-performance-of-my-job/m-p/117169#M45442</link>
      <description>&lt;DIV&gt;One job taking more than 7hrs, when i added below configuration its taking &amp;lt;2:30 mins but after deployment with same parameters again its taking 7+hrs.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;1) spark.conf.set("spark.sql.shuffle.partitions", 500) --&amp;gt; spark.conf.set("spark.sql.shuffle.partitions", 20000)&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;2) spark.catalog.clearCache()&lt;/DIV&gt;&lt;DIV&gt;for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; rdd.unpersist()&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; print("Unpersisted {} rdd".format(id))&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;3) DF = DF.withColumn('salt', F.rand())&lt;/DIV&gt;&lt;DIV&gt;DF = DF.repartition(100, 'salt')&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Tried with fixed 20 nodes still taking 7+ hrs after deployment(no change in notebook and cluster configuration)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Before deployment 1:20(Auto scaling) also taking &amp;lt;2:30 mins&lt;/DIV&gt;&lt;P&gt;whenever iowait is taking time then my job taking more time to complete.&lt;/P&gt;</description>
      <pubDate>Wed, 30 Apr 2025 14:42:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-iowait-will-it-impact-performance-of-my-job/m-p/117169#M45442</guid>
      <dc:creator>db_eswar</dc:creator>
      <dc:date>2025-04-30T14:42:37Z</dc:date>
    </item>
    <item>
      <title>Re: what is iowait, will it impact performance of my job</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-iowait-will-it-impact-performance-of-my-job/m-p/117177#M45443</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/162847"&gt;@db_eswar&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;High iowait in your Spark jobs is probably caused by storage or disk bottlenecks, not CPU or memory issues. The slowdown you're seeing could be due to a cold cache, slower disks, or increased resource usage.&lt;/P&gt;&lt;P&gt;To troubleshoot, you can use the Spark UI and your cloud provider's monitoring tools to keep an eye on iowait, disk, and network activity during the job. It's also a good idea to avoid shared or overloaded disk and network resources. For critical jobs, dedicated clusters with high-throughput storage work best. And don't forget to make sure Delta caching is being used properly for better performance.&lt;/P&gt;</description>
      <pubDate>Wed, 30 Apr 2025 15:21:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-iowait-will-it-impact-performance-of-my-job/m-p/117177#M45443</guid>
      <dc:creator>SP_6721</dc:creator>
      <dc:date>2025-04-30T15:21:18Z</dc:date>
    </item>
    <item>
      <title>Re: what is iowait, will it impact performance of my job</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-iowait-will-it-impact-performance-of-my-job/m-p/117181#M45445</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/156441"&gt;@SP_6721&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Likely Causes&lt;/STRONG&gt;&lt;BR /&gt;1. I/O Wait Indicates Disk or Network Latency&lt;BR /&gt;High IO wait usually means the CPU is idle waiting for disk or network I/O. Common reasons:&lt;BR /&gt;- Slow disk (DBFS / external storage) access (e.g., S3, ADLS Gen2 throttling)&lt;BR /&gt;- Data skew causing a few tasks to spill to disk&lt;BR /&gt;- Cluster nodes shared across multiple jobs or not warmed up&lt;BR /&gt;- Cold cache on cluster startup — your earlier run might have benefited from cached metadata or files&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;2. Partition Explosion or Skew&lt;BR /&gt;You increased `spark.sql.shuffle.partitions` from 500 to 20,000. That can:&lt;BR /&gt;- Improve performance if you have extremely large data evenly distributed&lt;BR /&gt;- Slow down execution if partitions are skewed or task scheduling overhead grows&lt;BR /&gt;Also, repartition(100, 'salt') introduces randomness, which can sometimes mask skew but not eliminate it.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;3. Sticky Cluster or Deployment Issue&lt;BR /&gt;Even with the same config:&lt;BR /&gt;- New deployment might hit different underlying compute nodes&lt;BR /&gt;- Some clusters have “cold start” penalties (nodes downloading libraries, syncing with workspace, etc.)&lt;BR /&gt;- Deployment may trigger different Spark runtime versions or settings (check Spark UI)&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What You Can Do&lt;/STRONG&gt;&lt;BR /&gt;A. Check Spark UI (Stage-Level Analysis)&lt;BR /&gt;- Go to Spark UI &amp;gt; Stages&lt;BR /&gt;- Look for long tails in tasks (some tasks taking much longer)&lt;BR /&gt;- Look at Shuffle Read/Write, Task Duration, GC Time, and Skew&lt;/P&gt;&lt;P&gt;B. Practical Fixes&lt;/P&gt;&lt;P&gt;1. Lower `spark.sql.shuffle.partitions` back to 1000–2000 if 20K is too high for your data volume.&lt;BR /&gt;spark.conf.set("spark.sql.shuffle.partitions", 1000)&lt;BR /&gt;2. Persist at the right stages: Don’t clear cache immediately if reused. Use `.checkpoint()` or `.persist()` wisely if DF is used multiple times.&lt;BR /&gt;3. Monitor I/O Metrics:&lt;BR /&gt;- Use Ganglia /Databricks Metrics/ CloudWatch (if on AWS) to observe:&lt;BR /&gt;- Disk IOPS&lt;BR /&gt;- Network throughput&lt;BR /&gt;- CPU IOwait %&lt;BR /&gt;4. Skew Mitigation:&lt;BR /&gt;- Use salting on skewed joins, not just on DF.&lt;BR /&gt;- Inspect .countByKey() distribution to detect skew.&lt;/P&gt;&lt;P&gt;5. Data locality: Use spark.locality.wait=0s if tasks are stuck waiting for preferred nodes.&lt;BR /&gt;6. Try autoscaling again: Since 20-node fixed cluster isn’t helping, try autoscaling from 10–30.&lt;/P&gt;</description>
      <pubDate>Wed, 30 Apr 2025 16:27:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-iowait-will-it-impact-performance-of-my-job/m-p/117181#M45445</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-04-30T16:27:56Z</dc:date>
    </item>
  </channel>
</rss>

